2 The Normal (and Related) Distributions

2.1 Motivation

In this chapter, we illustrate probability and statistical inference concepts utilizing the normal distribution. The normal, also known in the physical sciences as the Gaussian distribution and, colloquially, as the “bell curve,” is the most often utilized probability distribution in data analyses, for a number of reasons.

We often observe that the data that we sample in many experiments are at least approximately normally distributed. And while other, more general families of distributions (such as the gamma distribution) might explain data just as well as the normal, the normal has the advantage of having intuitively easy-to-understand parameters: $\mu$ for the mean, and $\sigma^2$ for the variance (meaning that $\sigma$ directly indicates the “width” of the region on the real-number line over which $f_X(x)$ is effectively non-zero).
The normal distribution is the limiting distibution for many other distributions (i.e., there are many families of distributions that, with the right choice(s) for parameter value(s), can mimic the shape of the normal.
The normal distribution figures prominently in the Central Limit Theorem, which states that the sample mean of a sufficiently large sample of data, from any distribution with finite variance, is approximately normally distributed.

2.2 Probability Density Function

Recall: a probability density function is one way to represent a continuous probablity distribution, and it has the properties (a) $f_X(x) \geq 0$ and (b) $\int_x f_X(x) dx = 1$, where the integral is over all values of $x$ in the distribution’s domain.

Let $X$ be a continuous random variable sampled from a normal distribution with mean $\mu$ and variance $\sigma^2$: $X \sim \mathcal{N}(\mu,\sigma^2)$. The pdf for $X$ is \[ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) ~~~~~~ x \in (-\infty,\infty) \,. \] A first thing we notice about this pdf is that it is symmetric around $\mu$. A second thing that we notice is that the integral under this curve approaches 1 over the range $\mu - 3\sigma \leq x \leq \mu + 3\sigma$. (The value of the integral of $f_X(x)$ over this range is 0.9973. This gives rise to one aspect of the so-called empirical rule: if data are approximately normally distributed, we expect all of the data to lie with three standard deviations of the sample mean, with rare exceptions.) See Figure 2.1.

$\label{fig:normpdf}A normal probability density function with mean 0 and standard deviation 1.$

Figure 2.1: A normal probability density function with mean 0 and standard deviation 1.

Recall: the expected value of a continuously distributed random variable is \[ E[X] = \int_x x f_X(x) dx\,, \] where the integral is over all values of $x$ within the domain of the pdf $f_X(x)$. The expected value is equivalent to a weighted average, with the weight for each possible value of $x$ given by $f_X(x)$.

The expected value of $X$ is \[ E[X] = \int_{-\infty}^\infty x \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) dx \] We implement a variable substitution to evaluate this integral. Recall that the three steps of variable substitution are…

to write down a viable substitution $u = g(x)$;
to then derive $du = h(u,x) dx$; and finally
to use $u = g(x)$ to transform the bounds of the integral.

For this particular integral: \[ (1) ~~ u = \frac{x-\mu}{\sigma} ~~ \implies ~~ (2) ~~ du = \frac{1}{\sigma}dx \] and \[ (3) ~~ x = -\infty ~\implies~ u = -\infty ~~~ \mbox{and} ~~~ x = \infty ~\implies~ u = \infty \,, \] Thus \[\begin{align*} E[X] &= \int_{-\infty}^\infty (\sigma u + \mu) \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{u^2}{2}\right) \sigma du \\ &= \int_{-\infty}^\infty \frac{\sigma u}{\sqrt{2 \pi}} \exp\left(-\frac{u^2}{2}\right) du + \int_{-\infty}^\infty \frac{\mu}{\sqrt{2 \pi}} \exp\left(-\frac{u^2}{2}\right) du \end{align*}\] The first integrand is the product of an odd function ($u$) and an even function ($\exp(-u^2/2)$), and thus the first integral evaluates to zero. The second integral is \[ E[X] = \mu \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{u^2}{2}\right) du = \mu \,, \] since the integrand is a normal pdf (with parameters $\mu = 0$ and $\sigma^2$ = 1) and the bounds of the integral match that of the domain of the normal pdf, making the value of the integral 1.

2.2.1 Variance of a Normal Probability Density Function

Recall: the variance of a continuously distributed random variable is \[ V[X] = \int_x (x-\mu)^2 f_X(x) dx = E[X^2] - (E[X])^2\,, \] where the integral is over all values of $x$ within the domain of the pdf $f_X(x)$. The variance represents the square of the “width” of a probability density function, where by “width” we mean the range of values of $x$ for which $f_X(x)$ is effectively non-zero.

As seen above, we want to compute $E[X^2] - (E[X])^2$. We have already computed $E[X]$: it is equal to $\mu$. Here, we compute $E[X^2]$.

The variable substitution setup is entirely the same, except that now \[\begin{align*} E[X^2] &= \int_{-\infty}^\infty (\sigma t + \mu)^2 \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{t^2}{2}\right) \sigma dt \\ &= \int_{-\infty}^\infty \frac{\sigma^2 t^2}{\sqrt{2 \pi}} \exp\left(-\frac{t^2}{2}\right) dt + \int_{-\infty}^\infty \frac{2 \mu \sigma t}{\sqrt{2 \pi}} \exp\left(-\frac{t^2}{2}\right) dt + \int_{-\infty}^\infty \frac{\mu^2}{\sqrt{2 \pi}} \exp\left(-\frac{t^2}{2}\right) dt \,. \end{align*}\] The second integral is that of an odd function and thus evaluates to zero, and the third integral is, given results above, $\mu^2$. This leaves \[ E[X^2] = \int_{-\infty}^\infty \frac{\sigma^2 t^2}{\sqrt{2 \pi}} \exp\left(-\frac{t^2}{2}\right) dt + \mu^2 \,. \] We note that if we were to apply the shortcut formula, the $\mu^2$ immediately above would cancel with $(E[X])^2 = \mu^2$, so now we can say that \[ V[X] = \frac{\sigma^2}{\sqrt{2 \pi}} \int_{-\infty}^\infty t^2 \exp\left(-\frac{t^2}{2}\right) dt \,. \] This requires integration by parts. Setting aside the constants outside the integral, we have that \[ \begin{array}{ll} u = -t & dv = -t \exp(-t^2/2) dt \\ du = -dt & v = \exp(-t^2/2) \end{array} \,, \] and so \[\begin{align*} \int_{-\infty}^\infty t^2 \exp\left(-\frac{t^2}{2}\right) dt &= \left. uv \right|_{-\infty}^{\infty} - \int_{-\infty}^\infty v du \\ &= - \left. t \exp(-t^2/2) \right|_{-\infty}^{\infty} + \int_{-\infty}^\infty \exp(-t^2/2) dt \\ &= 0 + \sqrt{2\pi} \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}} \exp(-t^2/2) dt \\ &= \sqrt{2\pi} \,. \end{align*}\] The first term evaluates to zero between for each bound, as $e^{-t^2/2}$ goes to zero much faster than $\vert t \vert$ goes to infinity. As for the second term: we recognize that the integrand is a normal pdf with mean zero and variance one…so the integral by definition evaluates to 1. In the end, after restoring the constants we set aside above, we have that \[ V[X] = \frac{\sigma^2}{\sqrt{2 \pi}} \sqrt{2\pi} = \sigma^2 \,. \]

2.2.2 Skewness of the Normal Probability Density Function

The skewness of a pmf or pdf is a metric that indicates its level of asymmetry. Fisher’s moment coefficient of skewness is \[ E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right] \,. \] which, expanded out, becomes \[ \frac{E[X^3] - 3 \mu \sigma^2 - \mu^3}{\sigma^3} \,. \] What is the skewness of the normal distribution? At first, this appears to require a long and involved series of integrations so as to solve $E[X^3]$. But let’s try to be more clever.

We know, from above, that the quantity $(\sigma t + \mu)^3$ will appear in the integral for $E[X^3]$. Let’s expand this out: \[ \sigma^3 t^3 + 3 \sigma^2 \mu t^2 + 3 \sigma \mu^2 t + \mu^3 \,. \] Each of these terms will appear separately in integrals of $\exp(-t^2/2)/\sqrt{2\pi}$ over the domain $-\infty$ to $\infty$. From above, what do we already know? First, we know that if $t$ is raised to an odd power, the integral will be zero. This eliminates $\sigma^3 t^3$ and $3 \sigma \mu^2 t$. Second, we know that for the $t^0$ term, the result will be $\mu^3$, and we know that for the $t^2$ term, the result will be $3 \sigma^2 \mu$. Thus \[ E[X^3] = 3\sigma^2\mu + \mu^3 \] and the skewness is \[ \frac{3 \mu \sigma^2 + \mu^3 - 3 \mu \sigma^2 - \mu^3}{\sigma^3} = 0 \,. \] The skewness is zero, meaning that a normal pdf is symmetric around its mean $\mu$.

2.2.3 Computing Probabilities

Before diving into this example, we will be clear that this is not the optimal way to compute probabilities associated with normal random variables, as we should utilize R’s pnorm() function, as shown in the next section. However, showing how to utilize integrate() is useful review. (We also show how to pass distribution parameters to integrate(), which we did not do in the last chapter.)

If $X \sim \mathcal{N}(10,4)$, what is $P(8 \leq X \leq 13.5)$?

To find this probability, we integrate over the normal pdf with mean $\mu = 10$ and variance $\sigma^2 = 4$. Visually, we are determining the area of the red-shaded region in Figure 2.2.

$\label{fig:pdfint}The probability $P(8 \leq X \leq 13.5)$ is the area of the red-shaded region, i.e., the integral of the normal probability density function from 8 to 13.5, assuming $\mu = 10$ and $\sigma^2 = 4$.$

Figure 2.2: The probability $P(8 \leq X \leq 13.5)$ is the area of the red-shaded region, i.e., the integral of the normal probability density function from 8 to 13.5, assuming $\mu = 10$ and $\sigma^2 = 4$.

integrand <- function(x,mu,sigma2) {
  return(exp(-(x-mu)^2/2/sigma2)/sqrt(2*pi*sigma2))
}
integrate(integrand,8,13.5,mu=10,sigma2=4)$value

## [1] 0.8012856

The result is 0.801. Note how the integrand() function requires extra information above and beyond the value of x. This information is passed in using the extra arguments mu and sigma2, the values of which are set in the call to integrate().

If $X \sim \mathcal{N}(13,5)$, what is $P(8 \leq X \leq 13.5 \vert X < 14)$?

Recall that $P(a \leq X \leq b \vert X \leq c) = P(a \leq X \leq b)/P(X \leq c)$, assuming that $a < b < c$. So this probability is, visually, the ratio of the area of the brown-shaded region in Figure 2.3 to the area of the red-shaded region underlying it. We can reuse the integrand() function from above, calling it twice:

integrate(integrand,8,13.5,mu=13,sigma2=5)$value / 
  integrate(integrand,-Inf,14,mu=13,sigma2=5)$value

## [1] 0.8560226

The result is 0.856. Note how we use -Inf in the second call to integrate(): Inf, like pi, is a reserved word in the R programming language.

$\label{fig:pdfcond}The conditional probability $P(8 \leq X \leq 13.5 \vert X < 14)$ is the ratio of the area of the brown-shaded region to the area of the red-shaded region underlying the brown-shaded region.$

Figure 2.3: The conditional probability $P(8 \leq X \leq 13.5 \vert X < 14)$ is the ratio of the area of the brown-shaded region to the area of the red-shaded region underlying the brown-shaded region.

2.3 Cumulative Distribution Function

Recall: the cumulative distribution function, or cdf, is another means by which to encapsulate information about a probability distribution. For a continuous distribution, it is defined as $F_X(x) = \int_{y \leq x} f_Y(y) dy$, and it is defined for all values $x \in (-\infty,\infty)$, with $F_X(-\infty) = 0$ and $F_X(\infty) = 1$.

The cdf for the normal distribution is the “accumulated probability” between $-\infty$ and the functional input $x$: \[ F_X(x) = \int_{-\infty}^x f_Y(y) dy = \int_{-\infty}^x \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right) dy \,. \] (See Figure 2.4.) Recall that because $x$ is the upper bound of the integral, we have to replace $x$ with some other variable in the integrand itself. (Here we choose $y$. The choice is arbitrary.)

$\label{fig:normcdf1}The cdf is the area of the red-shaded polygon.$

Figure 2.4: The cdf is the area of the red-shaded polygon.

To evaluate this integral, we again implement a variable substitution strategy: \[ (1) ~~ t = \frac{(y-\mu)}{\sqrt{2}\sigma} ~~\implies~~ (2) ~~ dt = \frac{1}{\sqrt{2}\sigma}dy \] and \[ (3) ~~ y = -\infty ~\implies~ t = -\infty ~~~ \mbox{and} ~~~ y = x ~\implies~ t = \frac{(x-\mu)}{\sqrt{2}\sigma} \,, \] so \[\begin{align*} F_X(x) &= \int_{-\infty}^x \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right) dy \\ &= \int_{-\infty}^{\frac{x-\mu}{\sqrt{2}\sigma}} \frac{1}{\sqrt{2\pi\sigma^2}} \exp(-t^2) \sqrt{2}\sigma dt = \frac{1}{\sqrt{\pi}} \int_{-\infty}^{\frac{x-\mu}{\sqrt{2}\sigma}} \exp(-t^2) dt \,. \end{align*}\] The error function is defined as \[ \mbox{erf}(z) = \frac{2}{\sqrt{\pi}} \int_0^z \exp(-t^2)dt \,, \] which is close to, but not quite, our expression for $F_X(x)$. (The integrand is the same, but the bounds of integration differ.) To match the bounds: \[\begin{align*} \frac{2}{\sqrt{\pi}} \int_{-\infty}^z \exp(-t^2)dt &= \frac{2}{\sqrt{\pi}} \left( \int_{-\infty}^0 \exp(-t^2)dt + \int_0^z \exp(-t^2)dt \right) \\ &= \frac{2}{\sqrt{\pi}} \left( -\int_0^{-\infty} \exp(-t^2)dt + \int_0^z \exp(-t^2)dt \right) \\ &= -\mbox{erf}(-\infty) + \mbox{erf}(z) = \mbox{erf}(\infty) + \mbox{erf}(z) = 1 + \mbox{erf}(z) \,. \end{align*}\] Here we make use of two properties of the error function: $\mbox{erf}(-z) = -\mbox{erf}(z)$, and $\mbox{erf}(\infty)=1$. Thus \[ \frac{1}{\sqrt{\pi}} \int_{-\infty}^z \exp(-t^2)dt = \frac{1}{2}[1 + \mbox{erf}(z)] \,. \] By matching this expression with that given for $F(x)$ above, we see that \[ F_X(x) = \frac{1}{2}\left[1 + \mbox{erf}\left(\frac{x-\mu}{\sqrt{2}\sigma}\right)\right] \,. \] In Figure 2.5, we plot $F_X(x)$. We note that while the error function is available to use directly in some R packages, it is provided only indirectly in base-R, via the pnorm() function. (In general, one computes cdf values for probability distributions in R using functions prefixed with p: pnorm(), pbinom(), punif(), etc.) Examining this figure, we see that this cdf abides by the properties listed in the previous chapter: $F(-\infty) = 0$ and $F(\infty) = 1$, and it is (strictly) monotonically increasing.

$\label{fig:normcdf2}The cdf for a normal distribution with mean 0 and standard deviation 1.$

Figure 2.5: The cdf for a normal distribution with mean 0 and standard deviation 1.

To compute the probability $P(a < X < b)$, we make use of the cdf. (As a reminder: if we have the cdf at our disposal and need to compute a probability…we should use it!) \[\begin{align*} P(a < X < b) &= P(X < b) - P(X < a) \\ &= F(b) - F(a) = \frac{1}{2}\left[ \mbox{erf}\left(\frac{b-\mu}{\sqrt{2}\sigma}\right) - \mbox{erf}\left(\frac{a-\mu}{\sqrt{2}\sigma}\right)\right] \,, \end{align*}\] which is more simply rendered in R as

pnorm(b,mean=mu,sd=sigma) - pnorm(a,mean=mu,sd=sigma)

Let’s look again at Figure 2.4. What if, instead of asking the question “what is the shaded area under the curve,” which is answered by integrating the normal pdf from $-\infty$ to a specified coordinate $x$, we instead ask the question “what value of $x$ is associated with a given area under the curve”? This is the inverse problem, one that we can solve so long as the relationship between $x$ and $F(x)$ is bijective (i.e., one-to-one): $x = F_X^{-1}[F_X(x)]$ for all $x$.

Recall: an inverse cdf function $F_X^{-1}(q)$ takes as input a distribution quantile $q \in [0,1]$ and returns the value of $x$ such that $q = F_X(x)$.

Let $q = F_X(x)$. Then, for the case of the normal cdf, we can write that \[ x = \sqrt{2} \sigma~ \mbox{erf}^{-1}\left(2q-1\right) + \mu \,, \] where $\mbox{erf}^{-1}(\cdot)$ is the inverse error function. Like the error function itself, the inverse error function is available for use in some R packages, but it is most commonly accessed, indirectly via the base-R function qnorm(). (In general, one computes inverse cdf values for probability distributions in R using functions prefixed with q: qnorm(), qpois(), qbeta(), etc.)

2.3.1 Computing Probabilities

We utilize the two examples provided in the last section to show how one would compute probabilities associated with the normal pdf by hand. But we note that a computer is still needed to derive the final numbers!

If $X \sim \mathcal{N}(10,4)$, what is $P(8 \leq X \leq 13.5)$?

We have that \[ P(8 \leq X \leq 13.5) = P(X \leq 13.5) - P(X \leq 8) = F_X(13.5 \vert 10,4) - F_X(8 \vert 10,4) \,. \] Well, this is where analytic computation ends, as we cannot evaluate the cdfs using pen and paper. Instead, we utilize the function pnorm():

pnorm(13.5,mean=10,sd=sqrt(4)) - pnorm(8,mean=10,sd=sqrt(4))

## [1] 0.8012856

We get the same result as we got in the last section: 0.801.

If $X \sim \mathcal{N}(13,5)$, what is $P(8 \leq X \leq 13.5 \vert X < 14)$?

Knowing that we cannot solve this by hand, we go directly to R:

(pnorm(13.5,mean=13,sd=sqrt(5)) - pnorm(8,mean=13,sd=sqrt(5))) / 
  pnorm(14,mean=13,sd=sqrt(5))

## [1] 0.8560226

The answer, as before, is 0.856.

But let’s go ahead and add some complexity here. How would we answer the following question?

If $\mu = 20$ and $\sigma^2 = 3$, what is the value of $a$ such that $P(20 - a \leq X \leq 20 + a) = 0.48$?

(The first thing to thing about here is: does the value of $\mu$ actually matter? The answer here is no. Think about why that may be.)

We have that \[\begin{align*} P(20-a \leq X \leq 20+a) = 0.48 &= P(X \leq 20+a) - P(X \leq 20-a)\\ &= F_X(20+a \vert 20,3) - F_X(20-a \vert 20,3) \,. \end{align*}\] Hmm…we’re stuck. We cannot invert this equation so as to isolate $a$…or can we? Remember that a normal pdf is symmetric around the mean. Hence \[ P(X \leq 20+a) = 1 - P(X > 20+a) = 1 - P(X \leq 20-a) \,. \] The area under the normal pdf from $20+a$ to $\infty$ is the same as the area from $-\infty$ to $20-a$. So… \[\begin{align*} P(20-a \leq X \leq 20+a) &= 1 - F_X(20-a \vert 20,3) - F_X(20-a \vert 20,3)\\ &= 1 - 2F_X(20-a \vert 20,3) \,, \end{align*}\] or \[ F_X(20-a \vert 20,3) = \frac{1}{2}[1 - P(20-a \leq X \leq 20+a)] = \frac{1}{2} 0.52 = 0.26 \,. \] Now, we invert the cdf and rearrange terms to get: \[ a = 20 - F_X^{-1}(0.26 \vert 20,3) \,, \] and once again, we transition to R:

a <- 20 - qnorm(0.26,mean=20,sd=sqrt(3))

The result is $a = 1.114$, i.e., $P(18.886 \leq X \leq 21.114) = 0.48$.

The reader might quibble with our assertion that $\mu$ doesn’t matter here, because, after all, the value of $\mu$ appears in the final code. However, if we changed both values of 20 to some other, arbitrary value, we would derive the same value for $a$.

2.3.2 Visualizing a Cumulative Distribution Function

Let’s say we want to make a figure like that in Figure 2.4, where we are given that our normal distribution has mean $\mu = 10$ and variance $\sigma^2 = 4$, and we want to overlay the shaded region associated with $F_X(9)$. How do we do this?

The first step is to determine the population standard deviation. Here, that would be $\sigma = \sqrt{4} = 2$.

The second step is to determine an appropriate range of values of $x$ over which to plot the cdf. Here we will assume that $\mu - 4\sigma$ to $\mu + 4\sigma$ is sufficient; this corresponds to $x = 2$ to $x = 18$.

The third step is to code. The result of our coding is shown in Figure 2.6.

x         <- seq(2,18,by=0.05)       # vector with x = {2,2.05,2.1,...,18}
f.x       <- dnorm(x,mean=10,sd=2)   # compute f(x) for all x
df        <- data.frame(x=x,f.x=f.x) # define a dataframe with above data
df.shaded <- subset(df,x<=9)         # get subset of data with x values <= 9

ggplot(data=df,aes(x=x,y=f.x)) +
  geom_line(col="blue",lwd=1) +     
  geom_area(data=df.shaded,fill="red",col="blue",outline.type="full") +
  geom_hline(yintercept=0,lwd=1) +
  labs(y = expression(f[X]*"(x)")) +
  theme(axis.title=element_text(size = rel(1.25)))

$\label{fig:normviz}The cumulative distribution function is the area of the red-shaded region.$

Figure 2.6: The cumulative distribution function is the area of the red-shaded region.

For the polygon, the first $(x,y)$ pair is $(9,0)$, the second is $(2,0)$ (where $x = 2$ is the lower limit of the plot…there is no need to take the polygon further “to the left”), and the next set are $(x,f_X(x))$ for all $x$ values $\leq 9$. The last point is the first point: this closes the polygon.

2.4 Moment-Generating Functions

In the previous chapter, we learned that if we linearly transform a random variable, i.e., if we define a new random variable $Y = \sum_{i=1}^n a_i X_i + b$, where $\{a_1,\ldots,a_n\}$ and $b$ are constants, then \[ E[Y] = E\left[\sum_{i=1}^n a_i X_i + b\right] = b + \sum_{i=1}^n a_i E[X_i] \,. \] Furthermore, if we assume that the $X_i$’s are independent random variables, then the variance of $Y$ is \[ V[Y] = V\left[\sum_{i=1}^n a_i X_i + b\right] = \sum_{i=1}^n a_i^2 V[X_i] \,. \] (Because the $X_i$’s are independent, we do not need to take into account any covariance, or linear dependence, between the $X_i$’s, simplifying the equation for $V[Y]$. We discuss how covariance is taken into account in Chapter 6.) So we know where $Y$ is centered and we know how “wide” it is. However, we don’t yet know the shape of the distribution for $Y$, i.e., we don’t yet know $f_Y(y)$. To show how we might derive the distribution, we introduce a new concept, that of the moment-generating function, or mgf.

In the previous chapter, we introduced the concept of distribution moments, i.e., the expected values $E[X^k]$ and $E[(X-\mu)^k]$. The moments of a given probability distribution are unique, and can often (but not always) be neatly encapsulated in a single mathematical expression: the moment-generating function (or mgf). To derive an mgf for a given distribution, we invoke the Law of the Unconscious Statistician: \[\begin{align*} m_X(t) = E[e^{tX}] &= \int_x e^{tX} f_X(x) \\ &= \int_x \left[1 + tx + \frac{t^2}{2!}x^2 + \cdots \right] f_X(x) \\ &= \int_x \left[ f_X(x) + t x f_X(x) + \frac{t^2}{2!} x^2 f_X(x) + \cdots \right] \\ &= \int_x f_X(x) + t \int_x x f_X(x) + \frac{t^2}{2!} \int_x x^2 f_X(x) + \cdots \\ &= 1 + t E[X] + \frac{t^2}{2!} E[X^2] + \cdots \,. \end{align*}\] An mgf only exists for a particular distribution if there is a constant $b$ such that $m_X(t)$ is finite for $\vert t \vert < b$. (This is a detail that we will not concern ourselves with here, i.e., we will assume that the expressions we derive for $E[e^{tX}]$ are valid expressions.) An example of a distribution for which the mgf does not exist is the Cauchy distribution.

Moment-generating functions are called such because, as one might guess, they generate moments (via differentiation): \[ \left.\frac{d^k m_X(t)}{dt^k}\right|_{t=0} = \frac{d^k}{dt^k} \left. \left[1 + t E[X] + \frac{t^2}{2!} E[X^2] + \cdots \right]\right|_{t=0} = \left. \left[E[X^k] + tE[X^{k+1}] + \cdots\right] \right|_{t=0} = E[X^k] \,. \] The $k^{\rm th}$ derivative of an mgf, with $t$ set to zero, yields the $k^{\rm th}$ moment $E[X^k]$.

But, the reader may ask, why are mgfs important here, in the context of normal random variables? We already know all the moments of this distribution that we care to know: they are written down. The answer is that a remarkably useful property of mgfs is the following: if $Y = b + \sum_{i=1}^n a_i X_i$, where the $X_i$’s are independent random variables sampled from a distribution whose mgf exists, then \[ m_Y(t) = e^{bt} m_{X_1}(a_1 t) \cdot m_{X_2}(a_2 t) \cdots m_{X_n}(a_n t) = e^{bt} \prod_{i=1}^n m_{X_i}(a_i t) \,, \] where $m_{X_i}(\cdot)$ is the moment-generating function for the random variable $X_i$. An mgf is typically written as a function of $t$; the notation $m_{X_i}(a_it)$ simply means that when we evaluate the above equation, wherever there is a $t$, we plug in $a_it$. Here’s the key point: if we recognize the form of $m_Y(t)$ as matching that of the mgf for a given distribution, then we know the distribution for $Y$.

Below, we show how the method of moment-generating functions allows us to derive distributions for linearly transformed normal random variables.

2.4.1 Moment-Generating Function for a Probability Mass Function

Let’s assume that we sample data from the following pmf:

$x$	$p_X(x)$
0	0.2
1	0.3
2	0.5

What is the mgf for this distribution? What is its expected value?

To derive the mgf, we compute $E[e^{tX}]$: \[ m_X(t) = E[e^{tX}] = \sum_x e^{tx} p_X(x) = 1 \cdot 0.2 + e^t \cdot 0.3 + e^{2t} \cdot 0.5 = 0.2 + 0.3 e^t + 0.5 e^{2t} \,. \] This cannot be simplified, and thus this is the final answer. As far as the expected value is concerned: we could simply compute $E[X] = \sum_x x p_X(x)$, but since we have the mgf now, we can use it too: \[ E[X] = \left.\frac{d}{dt}m_X(t)\right|_{t=0} = \left. (0.3 e^t + e^{2t})\right|_{t=0} = 0.3 + 1 = 1.3 \,. \]

2.4.2 Moment-Generating Function for a Probability Density Function

Let’s assume that we now sample data from the following pdf: \[ f_X(x) = \frac{1}{\theta} e^{-x/\theta} \,, \] where $x \geq 0$ and $\theta > 0$. What is the mgf of this distribution? \[\begin{align*} E[e^{tX}] &= \int_0^\infty e^{tx} \frac{1}{\theta} e^{-x/\theta} dx = \frac{1}{\theta} \int_0^\infty e^{-x\left(\frac{1}{\theta}-t\right)} dx \\ &= \frac{1}{\theta} \int_0^\infty e^{-x/\theta'} dx = \frac{\theta'}{\theta} = \frac{1}{\theta} \frac{1}{(1/\theta - t)} = \frac{1}{(1-\theta t)} = (1-\theta t)^{-1} \,. \end{align*}\] The expected value of this distribution can be computed via the integral $\int_x x f_X(x) dx$, but again, as we now have the mgf, \[ E[X] = \left.\frac{d}{dt}m_X(t)\right|_{t=0} = \left. -(1-\theta t)^{-2} (-\theta)\right|_{t=0} = \theta \,. \]

2.5 Linear Functions of Normal Random Variables

Let’s assume we are given $n$ normal random variables $\{X_1,\ldots,X_n\}$, which are independent (but not necessarily identically distributed), and we wish to determine the distribution of the sum $Y = b + \sum_{i=1}^n a_i X_i$. We recall that the expected value operator $E[X]$ and the variance operator $V[X]$ are linear operators, thus we can note immediately that…

the expected value is \[ E[Y] = E\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n E[a_i X_i] = \sum_{i=1}^n a_i E[X_i] = \sum_{i=1}^n a_i \mu_i \,; \]
and the variance is \[ V[Y] = V\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n V[a_i X_i] = \sum_{i=1}^n a_i^2 V[X_i] = \sum_{i=1}^n a_i^2 \sigma_i^2 \,. \]

As far as deriving the distribution: the mgf for the normal is \[ m_X(t) = \exp\left(\mu t + \sigma^2\frac{t^2}{2} \right) \,, \] and thus if we have $n$ independent normal random variables, we find that \[\begin{align*} m_Y(t) &= \exp\left(\mu_1a_1t+a_1^2\sigma_1^2\frac{t^2}{2}\right) \cdot \cdots \cdot \exp\left(\mu_na_nt+a_n^2\sigma_n^2\frac{t^2}{2}\right) \\ &= \exp\left[ (a_1\mu_1+\cdots+a_n\mu_n)t + \left(a_1^2\sigma_1^2 + \cdots + a_n^2\sigma_n^2\right)\frac{t^2}{2} \right] \\ &= \exp\left[ \left(\sum_{i=1}^n a_i\mu_i\right)t + \left(\sum_{i=1}^n a_i^2\sigma_i^2\right)\frac{t^2}{2} \right] \,. \end{align*}\] When we examine the result, we see immediately that it retains the functional form of a normal mgf, and thus we conclude that $Y$ itself is normally distributed, with mean $\sum_{i=1}^n a_i\mu_i$ and variance $\sum_{i=1}^n a_i^2\sigma_i^2$, i.e., \[ Y \sim \mathcal{N}\left(\sum_{i=1}^n a_i\mu_i,\sum_{i=1}^n a_i^2\sigma_i^2\right) \,. \]

2.5.1 The Distribution of the Sample Mean of iid Normal Random Variables

We have previously seen that when have a sample of $n$ iid random variables, $E[\bar{X}] = \mu$ and $V[\bar{X}] = \sigma^2/n$. Now we want to determine the distribution of $\bar{X}$, not just its mean and variance. In general, if $\bar{X} = (1/n)\sum_{i=1}^n X_i$, then \[\begin{align*} m_{\bar{X}}(t) &= m_{X_1}(a_1 t) \cdot m_{X_2}(a_2 t) \cdots m_{X_n}(a_n t) \\ &= m_{X_1}\left(\frac{t}{n}\right) \cdots m_{X_n}\left(\frac{t}{n}\right) = \prod_{i=1}^n m_{X_i}\left(\frac{t}{n}\right) \,, \end{align*}\] and thus \[\begin{align*} m_{\bar{X}}(t) &= \exp\left[ \left(\sum_{i=1}^n \frac{\mu}{n}\right)t + \left(\sum_{i=1}^n \frac{\sigma^2}{n^2}\right)\frac{t^2}{2} \right] \\ &= \exp\left( n \frac{\mu}{n} t + n \frac{\sigma^2}{n^2} \frac{t^2}{2} \right) = \exp\left(\mu t + \frac{\sigma^2}{n} \frac{t^2}{2} \right) \,. \end{align*}\] We see that $\bar{X} \sim \mathcal{N}(\mu,\sigma^2/n)$, i.e., that the sample mean observed in any given experiment is sampled from a normal distribution centered on the true mean, with a width that goes to zero as $n \rightarrow \infty$.

2.5.2 Using Variable Substitution to Determine Distribution of Y = aX + b

We set $y = ax+b$ and note that $dy = a dx$ and that if $x = \pm \infty$ then $y$ also equals $\pm \infty$. Thus \[\begin{align*} \int_{-\infty}^{\infty} \frac{1}{2\pi\sigma^2} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] dx &= \int_{-\infty}^{\infty} \frac{1}{2\pi\sigma^2} \exp\left[-\frac{([y-b]/a-\mu)^2}{2\sigma^2}\right] \frac{dy}{a} \\ &= \int_{-\infty}^{\infty} \frac{1}{2\pi a^2\sigma^2} \exp\left[-\frac{(y-b-a\mu)^2}{2a^2\sigma^2}\right] dy \\ &= \int_{-\infty}^{\infty} \frac{1}{2\pi a^2\sigma^2} \exp\left[-\frac{(y-[a\mu+b])^2}{2a^2\sigma^2}\right] dy \end{align*}\] The key point here is that the integrand has the functional form of a normal pdf and the integral bounds match the domain of a normal pdf…hence $Y$ is normally distributed, with mean $a\mu+b$ and variance $a^2\sigma^2$. (This key point holds in general…if we have a pmf or pdf whose functional form and domain match that of a known, “named” family of distributions, then the pmf or pdf belongs to that family.)

2.6 Standardizing a Normal Random Variable with Known Variance

To standardize any random variable, we subtract the expected value and divide by the standard deviation, i.e., we set \[ Z = \frac{X - E[X]}{\sqrt{V[X]}} \,. \] If $X \sim \mathcal{N}(\mu,\sigma^2)$, what are the mean and variance of $Z$, and can we derive $f_Z(z)$?

Expected Value. If we write $Z = aX+b$, where $a$ is $1/\sigma$ and $b = -\mu/\sigma$, then \[ E[Z] = E[aX+b] = \frac{1}{\sigma}E[X] - b = \frac{\mu}{\sigma} - \frac{\mu}{\sigma} = 0 \,. \]
Variance. The variance is \[ V[Z] = V[aX+b] = a^2V[X] = \frac{1}{\sigma^2} \sigma^2 = 1 \,. \]

OK…so far, so good: the distribution is centered at 0 and has variance 1. To derive the distribution, we use the methods of mgfs.

The mgf for a normal random variable is \[ m_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right) \,. \] Harkening back to the last section, the mgf for $Z = aX+b$ is \[\begin{align*} m_Z(t) &= e^{bt} m_X(at) \\ &= \exp\left(-\frac{\mu t}{\sigma}\right) \exp\left(a \mu t + \frac{a^2 \sigma^2 t^2} {2}\right) \\ &= \exp\left(-\frac{\mu t}{\sigma}\right) \exp\left(\frac{\mu t}{\sigma} + \frac{\sigma^2 t^2}{ 2 \sigma^2}\right) \\ &= \exp\left(0 t + \frac{1^2 t^2}{2}\right) \,. \end{align*}\] This mgf retains the functional form of a normal mgf, but with $\mu = 0$ and $\sigma = 1$. Thus $Z \sim \mathcal{N}(0,1)$, and thus the pdf for $Z$ is \[ f_Z(z) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{z^2}{2}\right) ~~~~ z \in (-\infty,\infty) \,, \] while the cdf for $Z$ is \[ F_Z(z) = \Phi(z) = \frac{1}{2}\left[1 + \mbox{erf}\left(\frac{z}{\sqrt{2}}\right)\right] \,. \] $Z$ is a so-called standard normal random variable. By historical convention, the cdf of the standard normal distribution is denoted $\Phi(z)$ (“phi” of z, pronounced “fye”) rather than $F_Z(z)$.

Statisticians often standardize normally distributed random variables and perform probability calculations using the standard normal. This is unnecessary in the age of computers, but in the pre-computer era standardization greatly simplified calculations since all one needed was a single table of values of $\Phi(z)$ to compute probabilities. That said, standardization has its uses. For instance, mentally computing an approximate value for $P(6 < X < 12)$ when $X \sim \mathcal{N}(9,9)$ can be quite a bit more taxing than if we were to write down the equivalent expression $P(-1 < Z < 1)$ and then evaluate that. (A skilled practitioner would know right away that the latter expression evaluates to $\approx$ 0.68.)

Common $Z$-Range/Probability Conversions
$(z_L,z_U)$	$\pm$ 1	$\pm$ 2	$\pm$ 3	$\pm$ 1.645	$\pm$ 1.960	$\pm$ 2.576
prob.	0.6837	0.9544	0.9973	0.90	0.95	0.99

2.6.1 Computing Probabilities

Here we will reexamine the three examples that we worked through above in the section in which we introduce the normal cumulative distribution function; here, we will utilize standardization. Note: the final results will all be the same!

If $X \sim \mathcal{N}(10,4)$, what is $P(8 \leq X \leq 13.5)$?

We standardize $X$: $Z = (X-\mu)/\sigma = (X-10)/2$. Hence the bounds of integration are $(8-10)/2 = -1$ and $(13.5-10)/2 = 1.75$, and the probability we seek is \[ P(-1 \leq Z \leq 1.75) = \Phi(1.75) - \Phi(-1) \,. \] To compute the final number, we utilize pnorm() with its default values of mean=0 and sd=1:

pnorm(1.75) - pnorm(-1)

## [1] 0.8012856

The probability is 0.801.

If $X \sim \mathcal{N}(13,5)$, what is $P(8 \leq X \leq 13.5 \vert X < 14)$?

The integral bounds are $(8-13)/\sqrt{5} = -\sqrt{5}$ and $(13.5-13)/\sqrt{5} = \sqrt{5}/10$ in the numerator, and $-\infty$ and $(14-13/\sqrt{5} = \sqrt{5}/5$ in the denominator. In R:

(pnorm(sqrt(5)/10) - pnorm(-sqrt(5))) / pnorm(sqrt(5)/5)

## [1] 0.8560226

The probability is 0.856.

If $\mu = 20$ and $\sigma^2 = 3$, what is the value of $a$ such that $P(20-a \leq X \leq 20+a) = 0.48$?

Here, \[ P(20-a \leq X \leq 20+a) = P\left( \frac{20-a-20}{\sqrt{3}} \leq \frac{X - 20}{\sqrt{3}} \leq \frac{20+a-20}{\sqrt{3}}\right) = P\left(-\frac{a}{\sqrt{3}} \leq Z \leq \frac{a}{\sqrt{3}} \right) \,, \] and we utilize the symmetry of the standard normal around zero to write \[ P\left(-\frac{a}{\sqrt{3}} \leq Z \leq \frac{a}{\sqrt{3}} \right) = 1 - 2P\left(Z \leq -\frac{a}{\sqrt{3}}\right) = 1 - 2\Phi\left(-\frac{a}{\sqrt{3}}\right) \,. \] Thus \[ 1 - 2\Phi\left(-\frac{a}{\sqrt{3}}\right) = 0.48 ~\Rightarrow~ \Phi\left(-\frac{a}{\sqrt{3}}\right) = 0.26 ~\Rightarrow~ -\frac{a}{\sqrt{3}} = \Phi^{-1}(0.26) = -0.64 \,, \] and thus $a = 1.114$. (Note that we numerically evaluate $\Phi^{-1}(0.26)$ using qnorm(0.26)).

2.7 General Transformations of a Single Random Variable

Thus far, when discussing transformations of a random variable, we have limited ourselves to linear transformations of independent r.v.’s, i.e., $Y = b + \sum_{i=1}^n a_i X_i$. What if instead we want to make a more general transformation of a single random variable, e.g., $Y = X^2 + 3$ or $Y = \sin X$? We cannot use the method of moment-generating functions here…we need a new method.

Let’s assume we have a random variable $X$, and we transform it according to a function $g(\cdot)$: $U = g(X)$. Then:

we identify the inverse function $X = g^{-1}(U)$;
we derive the cdf of $U$: $F_U(u) = P(U \leq u) = P(g(X) \leq u) = P(X \leq g^{-1}(u))$; and last
we derive the pdf of $U$: $f_U(u) = dF_U(u)/du$.

Recall: a continuous distribution’s pdf is the derivative of its cdf.

2.7.1 Distribution of a Transformed Random Variable

We are given the following pdf: \[ f_X(x) = 3x^2 \,, \] where $x \in [0,1]$.

What is the distribution of $U = X/3$?

We follow the three steps outlined above. First, we note that $U = X/3$ and thus $X = 3U$. Next, we find \[ F_U(u) = P(U \leq u) = P\left(\frac{X}{3} \leq u\right) = P(X \leq 3u) = \int_0^{3u} 3x^2 dx = \left. x^3 \right|_0^{3u} = 27u^3 \,, \] for $u \in [0,1/3]$. (The bounds are determined by plugging $x=0$ and $x=1$ into $u = x/3$.) Now we can derive the pdf for $U$: \[ f_U(u) = \frac{d}{du} 27u^3 = 54u^2 \,, \] for $u \in [0,1/3]$.

What is the distribution of $U = -X$?

We note that $U = -X$ and thus $X = -U$. Next, we find \[ F_U(u) = P(U \leq u) = P(-X \leq u) = P(X \geq -u) = \int_{-u}^{1} 3x^2 dx = \left. x^3 \right|_{-u}^{1} = 1 - (-u)^3 = 1 + u^3 \,, \] for $u \in [-1,0]$. (Note that the direction of the inequality changed in the probability statement because of the sign change from $-X$ to $X$, and the bounds are reversed to be in ascending order.) Now we derive the pdf: \[ f_U(u) = \frac{d}{du} (1+u^3) = 3u^2 \,, \] for $u \in [-1,0]$.

What is the distribution of $U = 2e^X$?

We note that $U = 2e^X$ and thus $X = \log(U/2)$. Next, we find \[\begin{align*} F_U(u) = P(U \leq u) &= P\left(2e^X \leq u\right) = P\left(X \leq \log\frac{u}{2}\right)\\ &= \int_0^{\log(u/2)} 3x^2 dx = \left. x^3 \right|_0^{\log(u/2)} = \left(\log\frac{u}{2}\right)^3 \,, \end{align*}\] for $u \in [2,2e]$. The pdf for $U$ is thus \[ f_U(u) = \frac{d}{du} \left(\log\frac{u}{2}\right)^3 = 3\left(\log\frac{u}{2}\right)^2 \frac{1}{u} \,, \] for $u \in [2,2e]$. Hint: if we want to see if our derived distribution is correct, we can code the pdf in R and integrate over the inferred domain.

integrand <- function(u) {
  return(3*(log(u/2))^2/u)
}
integrate(integrand,2,2*exp(1))$value

## [1] 1

Our answer is 1, as it should be for a properly defined pdf.

2.8 Squaring Standard Normal Random Variables

The square of a standard normal random variable is an important quantity that arises, e.g., in statistical model assessment (through the “sum of squared errors” when the “error” is normally distributed) and in hypothesis tests like the chi-square goodness-of-fit test. But for this quantity to be useful in statistical inference, we need to know its distribution.

In step (1) of the algorithm laid out in the last section, we identify the inverse function: if $U = Z^2$, with $Z \in (-\infty,\infty)$, then $Z = \pm \sqrt{U}$. Then, following step (2), we state that \[ F_U(u) = P(U \leq u) = P(Z^2 \leq u) = P(-\sqrt{u} \leq Z \leq \sqrt{u}) = \Phi(\sqrt{u}) - \Phi(-\sqrt{u}) \,, \] where, as we recall, $\Phi(\cdot)$ is the notation for the standard normal cdf. Because of symmetry, we can simplify this expression: \[ F_U(u) = \Phi(\sqrt{u}) - [1 - \Phi(\sqrt{u})] = 2\Phi(\sqrt{u}) - 1 \,. \]

To carry out step (3), we utilize the chain rule of differentiation to determine that the pdf is \[\begin{align*} f_U(u) = \frac{d}{du} F_U(u) = \frac{d}{du} [2\Phi(\sqrt{u}) - 1] &= 2 \frac{d\Phi}{du}(\sqrt{u}) \cdot \frac{d}{du}\sqrt{u} \\ &= 2 f_Z(\sqrt{u}) \cdot \frac{1}{2\sqrt{u}} \\ &= \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{u}{2}\right) \cdot \frac{1}{\sqrt{u}} \\ &= \frac{u^{-1/2}}{\sqrt{2\pi}} \exp(-\frac{u}{2}) \,, \end{align*}\] with $u \in [0,\infty)$. This is the pdf for a chi-square distribution with one “degree of freedom”: \[ U \sim \chi_1^2 \,, \] with the subscript “1” indicating the number of degrees of freedom. In general, the number of degrees of freedom can be an arbitrary positive integer, and it is conventionally denoted with the Greek letter $\nu$ (nu, pronounced “noo”).

Let’s now look at a sample of $n$ independent standard-normal random variables $\{Z_1,\ldots,Z_n\}$. What is the distribution of $W = \sum_{i=1}^n Z_i^2$? This is a sum of independent random variables, so we can use mgfs to try to answer this question. The mgf for a chi-square random variable with one degree of freedom is \[ m_{Z^2}(t) = (1-2t)^{-\frac{1}{2}} \,, \] and thus the mgf for the sum of independent chi-square-distributed random variables will be \[ m_W(t) = \prod_{i=1}^n m_{Z_i^2}(t) = \prod_{i=1}^n (1-2t)^{-\frac{1}{2}} = (1-2t)^{-\frac{n}{2}} \,. \] We identify $m_W(t)$ as the mgf for a chi-square distribution with $\nu = n$ degrees of freedom. Thus: if we sum chi-square-distributed random variables, the summed quantity is itself chi-square distributed, with $\nu$ being the sum of the numbers of degrees of freedom for the original random variables. (This result can be generalized: if $X_1 \sim \chi_a^2$ and $X_2 \sim \chi_b^2$, then $X_1 + X_2 = W \sim \chi_{a+b}^2$.)

For completeness, we write down the pdf for $\chi_{\nu}^2$: \[ f_X(x) = \frac{x^{\nu/2-1}}{2^{\nu/2} \Gamma(\nu/2)} \exp\left(-\frac{x}{2}\right) \,, \] where $x \geq 0$ and $\nu$ is a positive integer; $E[X] = \nu$ and $V[X] = 2\nu$. (As we will see in Chapter 4, the chi-square family of distributions is a “sub-family” of the more general gamma family of distributions.) The chi-square distribution is highly skew for small values of $\nu$, while chi-square random variables converge in distribution to normal random variables as $\nu \rightarrow \infty$. See Figure 2.7.

$\label{fig:chi2}Chi-square distributions for $\nu = 3$ (left) and $\nu = 30$ (right) degrees of freedom. Chi-square random variables converge in distribution to normal random variables as $\nu \rightarrow \infty$.$ $\label{fig:chi2}Chi-square distributions for $\nu = 3$ (left) and $\nu = 30$ (right) degrees of freedom. Chi-square random variables converge in distribution to normal random variables as $\nu \rightarrow \infty$.$

Figure 2.7: Chi-square distributions for $\nu = 3$ (left) and $\nu = 30$ (right) degrees of freedom. Chi-square random variables converge in distribution to normal random variables as $\nu \rightarrow \infty$.

2.8.1 The Expected Value of a Chi-Square Random Variable

The expected value of a chi-square distribution for one degree of freedom is \[\begin{align*} E[X] &= \int_0^\infty x f_X(x) dx \\ &= \int_0^\infty \frac{x^{1/2}}{\sqrt{2\pi}}\exp\left(-\frac{x}{2}\right) dx \,. \end{align*}\] To find the value of this integral, we are going to utilize the gamma function \[ \Gamma(t) = \int_0^\infty u^{t-1} \exp(-u) du \,, \] which we first introduced in Chapter 1. (Note that $t > 0$.) Recall that if $t$ is an integer, then $\Gamma(t) = (t-1)! = (t-1) \times (t-2) \times \cdots \times 1$, with the exclamation point representing the factorial function. If $t$ is a half-integer and small, one can look up the value of $\Gamma(t)$ online.

The integral we are trying to compute doesn’t quite match the form of the gamma function integral…but as one should recognize by now, we can attempt a variable substitution to change $e^{-x/2}$ to $e^{-u}$: \[ u = x/2 ~,~ du = dx/2 ~,~ x = 0 \implies u = 0 ~,~ x = \infty \implies u = \infty \] We thus rewrite our expected value as \[\begin{align*} E[X] &= \int_0^\infty \frac{\sqrt{2}u^{1/2}}{\sqrt{2\pi}} \exp(-u) (2 du) = \frac{2}{\sqrt{\pi}} \int_0^\infty u^{1/2} \exp(-u) du \\ &= \frac{2}{\sqrt{\pi}} \Gamma\left(\frac{3}{2}\right) = \frac{2}{\sqrt{\pi}} \frac{\sqrt{\pi}}{2} = 1 \,. \end{align*}\] As we saw above, the sum of $n$ chi-square random variables, each distributed with 1 degree of freedom, is itself chi-square-distributed for $n$ degrees of freedom. Hence, in general, if $W \sim \chi_{n}^2$, then $E[W] = n$.

2.8.2 Computing Probabilities

Let’s assume at first that we have a single random variable $Z \sim \mathcal{N}(0,1)$.

What is $P(Z^2 > 1)$?

We know that $Z^2$ is sampled from a chi-square distribution for 1 degree of freedom. So \[ P(Z^2 > 1) = 1 - P(Z^2 \leq 1) = 1 - F_{W(1)}(1) \,. \] This is not easily computed by hand, so we utilize R:

1 - pchisq(1,1)

## [1] 0.3173105

The probability is 0.3173. With the benefit of hindsight, we can see why this was going to be the value all along: we know that $P(-1 \leq Z \leq 1) = 0.6827$, and thus $P(\vert Z \vert > 1) = 1-0.6827 = 0.3173$. If we square both sides in this last probability statement, we see that $P(\vert Z \vert > 1) = P(Z^2 > 1)$.

What is the value $a$ such that $P(W > a) = 0.9$, where $W = Z_1^2 + \cdots + Z_4^2$ and $\{Z_1,\ldots,Z_4\}$ are iid standard normal random variables?

First, we recognize that $W \sim \chi_4^2$, i.e., $W$ is chi-square distributed for 4 degrees of freedom.

Second, we re-express $P(W > a)$ in terms of the cdf of the chi-square distribution, \[ P(W > a) = 1 - P(W \leq a) = 1 - F_{W(4)}(a) = 0.9 \,, \] and we rearrange terms: \[ F_{W(4)}(a) = 0.1 \,. \] To isolate $a$, we use the inverse CDF function: \[ F_{W(4)}^{-1} [F_{W(4)}(a)] = a = F_{W(4)}^{-1}(0.1) \,. \] To compute $a$, we use the R function qchisq():

(a = qchisq(0.1,4))

## [1] 1.063623

We find that $a = 1.064$.

2.9 Sample Variance of Normal Random Variables

Recall the definition of the sample variance: \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 \] The reason why we divide by $n-1$ instead of $n$ is that it makes $S^2$ an unbiased estimator of $\sigma^2$. We will illustrate this point below when we cover point estimation.

Above, we showed how the mean for a sample of independent and identically distributed normal random variables is itself normal, with mean $\mu$ and variance $\sigma^2/n$. What, on the other hand, is the distribution of $S^2$?

Let’s suppose that we have $n$ iid normal random variables. We can write down that \[ W = \sum_{i=1}^n \left( \frac{X_i-\mu}{\sigma} \right)^2 \,, \] and we know from results above that $W \sim \chi_n^2$. We can work with the expression to the right of the equals sign now to determine the distribution not of $S^2$ itself, but of $(n-1)S^2/\sigma^2$: \[\begin{align*} \sum_{i=1}^n \left( \frac{X_i-\mu}{\sigma} \right)^2 &= \sum_{i=1}^n \left( \frac{(X_i-\bar{X})+(\bar{X}-\mu)}{\sigma} \right)^2 \\ &= \sum_{i=1}^n \left( \frac{X_i-\bar{X}}{\sigma}\right)^2 + \sum_{i=1}^n \left( \frac{\bar{X}-\mu}{\sigma}\right)^2 + \mbox{cross~term~equaling~zero} \\ &= \frac{(n-1)S^2}{\sigma^2} + n\left(\frac{\bar{X}-\mu}{\sigma}\right)^2 = \frac{(n-1)S^2}{\sigma^2} + \left(\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\right)^2 \,. \end{align*}\] The expression farthest to the right, \[ Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \,, \] is the standardization of $\bar{X} \sim \mathcal{N}(\mu,\sigma^2/n)$, and thus it is standard-normal distributed, and thus $Z^2 \sim \chi_1^2$. By utilizing the general result about the sum of chi-square-distributed random variables given at the end of the last section, we can immediately see that \[ \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2 \,. \] We note that if we want to determine the distribution of $S^2$ itself, all we would need to do is perform a general transformation, following the steps outlined earlier.

2.9.1 Computing Probabilities

Let’s suppose we sample $16$ iid normal random variables, and we know that $\sigma^2 = 10$. What is the probability $P(S^2 > 12)$?

(We’ll stop for a moment to answer a question that might occur to the reader. “Why are we doing a problem where we assume $\sigma^2$ is known? In real life, it almost certainly won’t be.” This is an entirely fair question. This example is contrived, but it builds towards a particular situation where we can assume a value for $\sigma^2$: hypothesis testing. The calculation we will do below is analogous to calculations we will do later when testing hypotheses about normal population variances.)

The first question that we should always ask ourselves is whether we know the distribution of the quantity in the probability statement, in this case $S^2$. The answer is no. (As noted above, we can derive it, but we have not explicitly done so.) However, we do know the distribution of $(n-1)S^2/\sigma^2$. Hence: \[\begin{align*} P(S^2 > 12) &= P\left(\frac{(n-1)S^2}{\sigma^2} > \frac{(n-1) \cdot 12}{\sigma^2}\right) \\ &= P\left(W > \frac{15 \cdot 12}{10}\right) = P(W > 18) = 1 - P(W \leq 18) = 1 - F_{W(15)}(18) \,, \end{align*}\] where $W \sim \chi_{n-1}^2$ and $n-1 = 15$. To compute the probability, we use pchisq():

1 - pchisq(18,15)

## [1] 0.2626656

The probability is 0.263: there is only a 26.3% chance that we will sample a value of $S^2$ greater than 12 when the true value is 10.

To estimate the probability via simulation, we can

repeatedly generate data samples of size $n = 16$ from a normal distribution with some arbitrary mean (the value doesn’t matter) and variance $\sigma^2 = 10$;

compute and record $S^2$; and

determine the proportion of our simulated $S^2$ values that are greater than 12.

Note that we record $S^2$ and not $(n-1)S^2/\sigma^2$; as we are performing a simulation, we need not know the distribution of $S^2$ (because…we are simulating it!). All we need to know is the distribution from which the individual data are sampled. Also note that because we are not simulating an infinite number of data, the proportion that we observe will itself be a random variable with some mean, some variance, and some sampling distribution. The key here is “some variance”: to generate a more accurate accounting of the proportion of $S^2$ values greater than 12, we want to sample as much data as we can (a) without having to wait too long for the result, and (b) without causing memory allocation issues by recording too many values of $S^2$, if we choose to record them all.

set.seed(101)       # set so that the same data are generated every time
m      <- 100000    # the number of data samples
n      <- 16        # the size of each data sample
sigma2 <- 10        # the true variance
s2     <- rep(NA,m) # allocated space for S^2
for ( ii in 1:m ) {
  x      <- rnorm(n,mean=0,sd=sqrt(sigma2))
  s2[ii] <- var(x)
}
round(sum(s2>12)/m,3)

## [1] 0.262

Another way to code this to circumvent memory allocation is

set.seed(101)       # set so that the same data are generated every time
m      <- 100000    # the number of data samples
n      <- 16        # the size of each data sample
sigma2 <- 10        # the true variance
s2true <-  0        # a counter of the number of S^2 values > 12
for ( ii in 1:m ) {
  x <- rnorm(n,mean=0,sd=sqrt(sigma2))
  if ( var(x) > 12) s2true = s2true+1
}
round(s2true/m,3)

## [1] 0.262

The tradeoff is that while this code uses less memory, it takes (slightly) longer to run.

2.9.2 Expected Value of the Sample Variance and Standard Deviation

It is extremely straightforward to determine the expected value for the sample variance: \[ E\left[ \frac{(n-1)S^2}{\sigma^2} \right] = n-1 ~~~ \Rightarrow ~~~ E[S^2] = \frac{(n-1)\sigma^2}{(n-1)} = \sigma^2 \,. \] Here, we use the fact that $W = (n-1)S^2/\sigma^2$ is a chi-square-distributed random variable, and that $E[W]$ equals the number of degrees of freedom, which here is $n-1$.

What is more difficult to determine is $E[(S^2)^a]$, where $a$ is some constant. For instance, $E[S^4]$ is not $\sigma^4$ (making the variance of $S^2$ somewhat more difficult to compute than we might initiall expect)…and $E[S]$ is not $\sigma$. Let’s determine $E[S]$ here.

The way we do this is by performing a random variable transformation ($S = \sqrt{S^2}$) to determine the pdf $f_S(s)$, and then computing $E[S] = \int s f_S(s) ds$.

We start by writing \[\begin{align*} P(S \leq s) &= P(\sqrt{S^2} \leq s) = P(S^2 \leq s^2) = P\left( \frac{(n-1)S^2}{\sigma^2} \leq \frac{(n-1)s^2}{\sigma^2} \right) \\ &= P\left( W \leq \frac{(n-1)s^2}{\sigma^2} \right) = F_{W(n-1)}\left( \frac{(n-1)s^2}{\sigma^2} \right) \,. \end{align*}\] We can then use the chain rule of differentiation to write that \[ f_S(s) = \frac{d}{ds}F_{W(n-1)}\left( \frac{(n-1)s^2}{\sigma^2} \right) = f_{W(n-1)}F_{W(n-1)}\left( \frac{(n-1)s^2}{\sigma^2} \right) \frac{2(n-1)s}{\sigma^2} \,. \] We then substitute in the pdf: \[ f_S(s) = \frac{2(n-1)s}{\sigma^2} \frac{[(n-1)s^2/\sigma^2]^{(n-3)/2}}{2^{(n-1)/2}\Gamma((n-1)/2)} \exp\left(-\frac{(n-1)s^2}{2\sigma^2}\right) \,. \] The expected value of $S$ is then \[ E[S] = \int_0^\infty s f_S(s) ds = \int_0^\infty \frac{2(n-1)s^2}{\sigma^2} \frac{[(n-1)s^2/\sigma^2]^{(n-3)/2}}{2^{(n-1)/2}\Gamma((n-1)/2)} \exp\left(-\frac{(n-1)s^2}{2\sigma^2}\right) ds \,. \] OK…how should we pursue this? Let’s try a variable substitution approach, with \[ x = \frac{(n-1)s^2}{2\sigma^2} ~~~ \Rightarrow ~~~ dx = \frac{(n-1)s}{\sigma^2}ds = \frac{n-1}{\sigma^2} \left(\frac{2 \sigma^2 x}{n-1}\right)^{1/2} ds = \left(\frac{ 2 (n-1) x }{\sigma^2} \right)^{1/2} ds \,. \] We note that if $s = 0$, $x = 0$, and if $s = \infty$, $x = \infty$, so the integral bounds are unchanged. So now we have that \[\begin{align*} E[S] &= \int_0^\infty 4x \frac{(2x)^{(n-3)/2}}{2^{(n-1)/2}\Gamma((n-1)/2)} \exp(-x) \frac{\sigma}{\sqrt{2(n-1)x}} dx \\ &= \sqrt{2} \int_0^\infty \frac{x^{(n-2)/2}}{\Gamma((n-1)/2)} \exp(-x) \frac{\sigma}{\sqrt{n-1}} dx \\ &= \sqrt{2} \frac{\sigma}{\sqrt{n-1}} \frac{1}{\Gamma((n-1)/2)} \int_0^\infty x^{n/2-1} \exp(-x) dx \\ &= \sigma \sqrt{\frac{2}{n-1}} \frac{\Gamma(n/2)}{\Gamma((n-1)/2)} \,. \end{align*}\] The integral in the second-to-last line is that which defines the gamma function $\Gamma(\cdot)$. So we see that while $S^2$ is an unbiased estimator of $\sigma^2$, $S$ is a biased estimator of $\sigma$. We note, without getting into details, that as $n \rightarrow \infty$, $E[S] \rightarrow \sigma$, so $S$ is asymptotically unbiased.

2.10 Standardizing a Normal Random Variable with Unknown Variance

It is generally the case in real-life that when we assume our data are sampled from a normal distribution, both the mean and the variance are unknown. This means that instead of this \[ Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1) \,, \] we have \[ \frac{\bar{X}-\mu}{S/\sqrt{n}} \sim \mbox{?} \,. \] This last expression contains two random variables, one in the numerator and one in the denominator, and each is independent of the other. (We state this without proof…but we will appeal to intuition. For many distributions, the expected value and variance are both functions of one or more parameters $\theta$ [e.g., the exponential, binomial, and Poisson distributions, etc., etc.]. However, for a normal distribution, the expected value only depends on $\mu$ and the variance only depends on $\sigma^2$.) How would we determine the distribution of the ratio above?

There is no unique way by which to approach the derivation of the distribution: we simply illustrate one way of doing so. First, we rewrite the ratio above as \[ T = \frac{\bar{X}-\mu}{S/\sqrt{n}} = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} / \frac{S}{\sigma} = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} / \sqrt{\frac{S^2}{\sigma^2}} = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} / \left( \sqrt{\frac{\nu S^2}{\sigma^2}} \frac{1}{\sqrt{\nu}} \right) = Z / \left(\sqrt{\frac{W_{\nu}}{\nu}}\right) \,, \] where $Z \sim \mathcal{N}(0,1)$ and $W_{\nu}$ is sampled from a chi-square distribution with $\nu = n-1$ degrees of freedom. Our next step is to determine the distribution of the random variable $U = \sqrt{W_{\nu}/\nu}$. We identify that $V = \sqrt{W_{\nu}}$ is sampled from a chi distribution (note: no square!), whose pdf may be derived via a general transformation (since $V = \sqrt{\nu}S/\sigma$ and since we have written down $f_S(s)$ above), but can also simply be looked up: \[ f_V(v) = \frac{v^{\nu-1} \exp(-v^2/2)}{2^{\nu/2-1}\Gamma(\nu/2)} \,, \] for $v \geq 0$ and $\nu > 0$. However, we do still have to apply a general transformation to determine the pdf for $U = V/\sqrt{\nu}$. Doing so, we find that \[ f_U(u) = f_V(\sqrt{\nu}u) \sqrt{\nu} = \frac{\sqrt{\nu} (\sqrt{\nu}u)^{\nu-1} \exp(-(\sqrt{\nu}u)^2/2)}{2^{\nu/2-1}\Gamma(\nu/2)} \,. \] (So, yes, we could have derived $f_U(u)$ directly from $f_S(s)$ via a general transformation…but here we at least indicate to the reader that there is such a thing as the chi distribution.)

At this point, we know the distributions for both the numerator and the denominator. Now we write down a general result dubbed the ratio distribution. For $T = Z/U$, where $Z$ and $U$ are independent variables, that distribution is \[ f_T(t) = \int_{-\infty}^\infty \vert u \vert f_Z(tu) f_{U}(u) du \rightarrow \int_0^\infty u f_Z(tu) f_{U}(u) du \,, \] where we make use of the fact that $u > 0$. Thus \[\begin{align*} f_T(t) &= \int_0^\infty u \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{t^2u^2}{2}\right) \frac{\sqrt{\nu} (\sqrt{\nu}u)^{\nu-1} \exp(-(\sqrt{\nu}u)^2/2)}{2^{\nu/2-1}\Gamma(\nu/2)} du \\ &= \frac{1}{\sqrt{2\pi}} \frac{1}{2^{\nu/2-1}\Gamma(\nu/2)} \nu^{\nu/2} \int_0^\infty u^\nu \exp\left(-\frac{(t^2+\nu)u^2}{2}\right) du \,. \end{align*}\] Given the form of the integral and the fact that it is evaluated from zero to infinity, we will use a variable substitution approach and eventually turn the integral into a gamma function: \[ x = \frac{(t^2+\nu)u^2}{2} ~~~ \Rightarrow ~~~ dx = (t^2+\nu)~u~du \,. \] When $u = 0$, $x = 0$, and when $u = \infty$, $x = \infty$, so the integral bounds are unchanged. Hence we can rewrite the integral above as \[\begin{align*} f_T(t) &= \frac{1}{\sqrt{2\pi}} \frac{1}{2^{\nu/2-1}\Gamma(\nu/2)} \nu^{\nu/2} \int_0^\infty \left(\sqrt{\frac{2x}{(t^2+\nu)}}\right)^\nu \exp\left(-x\right) \frac{dx}{\sqrt{2 x (t^2+\nu)}} \\ &= \frac{1}{\sqrt{2\pi}} \frac{1}{2^{\nu/2-1}\Gamma(\nu/2)} \nu^{\nu/2} 2^{\nu/2} 2^{-1/2} \frac{1}{(t^2+\nu)^{(\nu+1)/2}} \int_0^\infty \frac{x^{\nu/2}}{x^{1/2}} \exp\left(-x\right) dx \\ &= \frac{1}{\sqrt{\pi}} \frac{1}{\Gamma(\nu/2)} \nu^{\nu/2} \frac{1}{(t^2+\nu)^{(\nu+1)/2}} \int_0^\infty x^{\nu/2-1/2} \exp\left(-x\right) dx \\ &= \frac{1}{\sqrt{\pi}} \frac{1}{\Gamma(\nu/2)} \nu^{\nu/2} \frac{1}{(t^2+\nu)^{(\nu+1)/2}} \Gamma((\nu+1)/2) \\ &= \frac{1}{\sqrt{\pi}} \frac{\Gamma((\nu+1)/2)}{\Gamma(\nu/2)} \nu^{\nu/2} \frac{1}{\nu^{(\nu+1)/2}(t^2/\nu+1)^{(\nu+1)/2}} \\ &= \frac{1}{\sqrt{\nu \pi}} \frac{\Gamma((\nu+1)/2)}{\Gamma(\nu/2)} \left(1+\frac{t^2}{\nu}\right)^{-(\nu+1)/2} \,. \end{align*}\] $T$ is sampled from a Student’s t distribution for $\nu$ degrees of freedom. Assuming that $\nu$ is integer-valued, the expected value of $T$ is $E[T] = 0$ for $\nu \geq 2$ (and is undefined otherwise), while the variance is $\nu/(\nu-2)$ for $\nu \geq 3$, is infinite for $\nu=2$, and is otherwise undefined. Appealing to intuition, we can form a (symmetric) $t$ distribution by taking a standard normal $\mathcal{N}(0,1)$ and “pushing down” in the center, the act of which transfers probability density equally to both the lower and upper tails. In Figure 2.8, we see that the smaller the number of degrees of freedom, the more density is transferred to the tails, and that as $\nu$ increases, the more and more $f_{T(\nu)}(t)$ becomes indistinguishable from a standard normal. (The more technical way of stating this is that the random variable $T$ converges in distribution to a standard normal random variable as $\nu \rightarrow \infty$.)

$\label{fig:t}The natural logarithm of the pdf for the standard normal (black) and for $t$ distributions with 3 (red dashed), 6 (green dotted), and 12 (blue dot-dashed) degrees of freedom. We observe that as $n$ decreases, there is more probability density in the lower and upper tails of the distributions.$

Figure 2.8: The natural logarithm of the pdf for the standard normal (black) and for $t$ distributions with 3 (red dashed), 6 (green dotted), and 12 (blue dot-dashed) degrees of freedom. We observe that as $n$ decreases, there is more probability density in the lower and upper tails of the distributions.

2.10.1 Computing Probabilities

In a study, the diameters of 8 widgets are measured. It is known from previous work that the widget diamaters are normally distributed, with sample standard deviation $S = 1$ unit. What is the probability that the the sample mean $\bar{X}$ observed in the current study will be within one unit of the population mean $\mu$?

The probability we seek is \[ P( \vert \bar{X} - \mu \vert < 1 ) = P(\mu - 1 \leq \bar{X} \leq \mu + 1) \,. \] The key here is that we don’t know $\sigma$, so the probability can only be determined if we manipulate this expression such that the random variable inside it is $t$-distributed…and $\bar{X}$ itself is not. So the first step is to standardize: \[ P\left( \frac{\mu - 1 - \mu}{S/\sqrt{n}} \leq \frac{\bar{X} - \mu}{S/\sqrt{n}} \leq \frac{\mu + 1 - \mu}{S/\sqrt{n}}\right) = P\left( -\frac{\sqrt{n}}{S} \leq T \leq \frac{\sqrt{n}}{S}\right) \,. \] We know that $\sqrt{n}(\bar{X} - \mu)/S$ is $t$-distributed (with $\nu = n-1$ degrees of freedom), so long as the individual data are normal iid random variables. (If the individual data are not normally distributed, this question might have to be solved via simulations…if we cannot determine the distribution of $\bar{X}$ using, e.g., moment-generating functions.)

The probability is the difference between two cdf values: \[ P\left( -\frac{\sqrt{n}}{S} \leq T \leq \frac{\sqrt{n}}{S}\right) = F_{T(7)}(\sqrt{8}) - F_{T(7)}(-\sqrt{8}) \,, \] where $n-1$, the number of degrees of freedom, is 7. This is the end of the problem if we are using pen and paper. If we have R at our disposal, we would code the following:

pt(sqrt(8),7) - pt(-sqrt(8),7)

## [1] 0.9745364

The probability that $\bar{X}$ will be within one unit of $\mu$ is 0.975.

2.11 Point Estimation

In the previous chapter, we introduced the concept of point estimation, using statistics to make inferences about a population parameter $\theta$. Recall that a point estimator $\hat{\theta}$, being a statistic, is a random variable and has a sampling distribution. One can define point estimators arbitrarily, but in the end, we can choose the best among a set of estimators by assessing properties of their sampling distributions:

Bias: $B[\hat{\theta}] = E[\hat{\theta}-\theta]$
Variance: $V[\hat{\theta}]$

Recall: the bias of an estimator is the difference between the average value of the estimates it generates and the true parameter value. If $E[\hat{\theta}-\theta] = 0$, then the estimator $\hat{\theta}$ is said to be unbiased.

Assuming that our observed data are independent and identically distributed (or iid), then computing each of these quantities is straightforward. Choosing a best estimator based on these two quantities might not lead to a clear answer, but we can overcome that obstacle by combining them into one metric, the mean-squared error (or MSE):

MSE: $MSE[\hat{\theta}] = B[\hat{\theta}]^2 + V[\hat{\theta}]$

In the previous chapter, we also discussed how guessing the form of an estimator is sub-optimal, and how there are various approaches to deriving good estimators. The first one that we highlight is maximum likelihood estimation (or MLE). We will apply the MLE here to derive an estimator for the normal population mean. Then we will introduce methodology by which to assess the estimator in an absolute sense, and extend these results to write down the asymptotic sampling distribution for the MLE, i.e., the sampling distribution as the sample size $n \rightarrow \infty$.

Recall: the value of $\theta$ that maximizes the likelihood function is the maximum likelihood estimate, or MLE, for $\theta$. The maximum is found by taking the (partial) derivative of the (log-)likelihood function with respect to $\theta$, setting the result to zero, and solving for $\theta$. That solution is the maximum likelihood estimate $\hat{\theta}_{MLE}$. Also recall the invariance property of the MLE: if $\hat{\theta}_{MLE}$ is the MLE for $\theta$, then $g(\hat{\theta}_{MLE})$ is the MLE for $g(\theta)$.

The setting, again, is that we have randomly sampled $n$ data from a normal distribution with parameters $\mu$ and $\sigma$. This means that the likelihood function is \[\begin{align*} \mathcal{L}(\mu,\sigma \vert \mathbf{x}) &= \prod_{i=1}^n f_X(x \vert \mu,\sigma) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right) \\ &= \frac{1}{(2 \pi)^{n/2}} \frac{1}{\sigma^n} \exp\left({-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2}\right) \,, \end{align*}\] and the log-likelihood is \[ \ell(\mu,\sigma \vert \mathbf{x}) = -\frac{n}{2} \log (2 \pi) - n \log \sigma - \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \,. \] Because there is more than one parameter, we take the partial derivative of $\ell$ with respect to $\mu$: \[\begin{align*} \ell'(\mu,\sigma \vert \mathbf{x}) &= \frac{\partial}{\partial \mu} \left( -\frac{n}{2} \log (2 \pi) - n \log \sigma - \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \right) \\ &= -\frac{1}{\sigma^2} \sum_{i=1}^n 2(x_i - \mu)(-1) \\ &= \frac{2}{\sigma^2} \sum_{i=1}^n (x_i - \mu) \,. \end{align*}\] After setting this quantity to zero and dividing out the term $2/\sigma^2$, we find that \[ \sum_{i=1}^n x_i = \sum_{i=1}^n \mu ~~\Rightarrow~~ \sum_{i=1}^n x_i = n \mu ~~\Rightarrow~~ \frac{1}{n} \sum_{i=1}^n x_i = n \mu ~~\Rightarrow~~ \mu = \bar{x} \,, \] and thus $\hat{\mu}_{MLE} = \bar{X}$. (Recall that as we go from a purely mathematical derivation to a final definition of an estimator, we need to take into account that the estimator is a random variable and thus we need to change from lower-case to upper-case when writing out the variable. Also recall that for this MLE to be valid, the second derivative of the log-likelihood function $\ell(\theta \vert \mathbf{x})$ at $x = \bar{x}$ has to be negative. We leave checking this as an exercise to the reader.)

Because the estimator is $\bar{X}$, we know immediately its expected value and variance given previous results:

$E[\hat{\mu}_{MLE}] = E[\bar{X}] = \mu$ (the estimator is unbiased); and
$V[\hat{\mu}_{MLE}] = V[\bar{X}] = \sigma^2/n$ (and thus $MSE[\hat{\mu}_{MLE}] = \sigma^2/n$).

While $\hat{\mu}_{MLE}$ is an unbiased estimator, recall that MLEs can be biased…but if they are, they are always asymptotically unbiased, meaning that the bias goes away as $n \rightarrow \infty$.

Here, we will introduce one more estimator concept, that of consistency: do both the bias and the variance of the estimator go to zero as the sample size $n$ increases? If so, our estimator is said to be consistent. If an estimator is not consistent, it will never converge to the true value $\theta$, regardless of sample size. Here, we need not worry about the bias ($\hat{\mu}_{MLE} = \bar{X}$ in unbiased for all $n$), and we note that \[ \lim_{n \rightarrow \infty} \frac{\sigma^2}{n} = 0 \,. \] Thus $\hat{\mu}_{MLE}$ is a consistent estimator.

The question that we are going to pose now refers to a point made obliquely in the previous chapter: how do we assess $V[\hat{\mu}_{MLE}]$ in absolute terms? Can we come up with an estimator that is more accurate for any given value of $n$?

The answer lies in the Cramer-Rao inequality, which specifies the lower bound on the variance of an estimator (so long as the domain of $p_X(x \vert \theta)$ or $f_X(x \vert \theta)$ does not depend on $\theta$ itself…which is true for the normal). In this work, we focus on determining lower bounds for unbiased estimators, but bounds can be found for biased estimators as well, as shown here.

If we sample $n$ iid data, then the Cramer-Rao lower bound (CRLB) on the variance of an unbiased estimator $\hat{\theta}$ is \[ V[\hat{\theta}] = \frac{1}{I_n(\theta)} = \frac{1}{nI(\theta)} \,, \] where $I(\theta)$ represents the Fisher information for $\theta$ contained in a single random variable $X$. Assuming we sample our data from a pdf, the Fisher information is \[ I(\theta) = E\left[\left(\frac{\partial}{\partial \theta} \log f_X(x \vert \theta) \right)^2\right] ~~~\mbox{or}~~~ I(\theta) = -E\left[ \frac{\partial^2}{\partial \theta^2} \log f_X(x \vert \theta) \right] \,. \] (The equation to the right is valid only under certain regularity conditions, but those conditions are usually met in the context of this book and thus this is the equation that we will use in general, as it makes for more straightforward computation.) Let’s step back for a second and think about what the Fisher information represents. It is the average value of the square of the first derivative of a pdf. If a pdf’s slope is relatively small (because the pdf is wide, like a normal with a large $\sigma$ parameter), then the average value of the slope, squared, is relatively small, i.e., the information that $X$ provides about $\mu$ or $\sigma$ is relatively small. This makes intuitive sense: a wide pdf means that, for instance, when we sample data and try to subsequently estimate $\theta$, we will be more uncertain about its actual value. On the other hand, if the pdf is highly concentrated, then the average slope of the pdf is larger and…$I(\theta)$ is relatively large; a sampled datum would contain more information by which to constrain $\theta$. (See Figure 2.9, in which the pdf to the left has less Fisher information content about $\mu$ than the pdf to the right.)

$\label{fig:fisher}An illustration of Fisher information. For the pdf to the left, the slope changes slowly, and thus the rate of change of the slope is smaller than for the pdf to the right. Thus the pdf to the left contains less information about the population mean, i.e., its Fisher information value $I(\mu)$ is smaller.$ $\label{fig:fisher}An illustration of Fisher information. For the pdf to the left, the slope changes slowly, and thus the rate of change of the slope is smaller than for the pdf to the right. Thus the pdf to the left contains less information about the population mean, i.e., its Fisher information value $I(\mu)$ is smaller.$

Figure 2.9: An illustration of Fisher information. For the pdf to the left, the slope changes slowly, and thus the rate of change of the slope is smaller than for the pdf to the right. Thus the pdf to the left contains less information about the population mean, i.e., its Fisher information value $I(\mu)$ is smaller.

For the normal, \[\begin{align*} \log f_X(X \vert \mu) &= -\frac{1}{2} \log (2\pi\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2} \\ \frac{\partial}{\partial \mu} \log f_X(X \vert \mu) &= 0 - \frac{2(x-\mu)(-1)}{2\sigma^2} = \frac{(x-\mu)}{\sigma^2} \\ \frac{\partial^2}{\partial \mu^2} \log f_X(X \vert \mu) &= -\frac{1}{\sigma^2} \,, \end{align*}\] so \[ I(\mu) = -E\left[-\frac{1}{\sigma^2}\right] = \frac{1}{\sigma^2} \,, \] and \[ V[\hat{\mu}] = \frac{1}{n (1/\sigma^2)} = \frac{\sigma^2}{n} \,. \] We see that the variance of $\hat{\mu} = \bar{X}$ achieves the CRLB, meaning that indeed we cannot propose a better unbiased estimator for the normal population mean.

We conclude this section by using the Fisher information to state an important result about maximum likelihood estimates. Recall that an estimator is a statistic, and thus it is a random variable with a sampling distribution. What do we know about this distribution? For arbitrary values of $n$, nothing, per se; we would have to run simulations to derive an empirical sampling distribution. However, as $n \rightarrow \infty$, the MLE converges in distribution to a normal random variable: \[ \sqrt{n}(\hat{\theta}_{MLE}-\theta) \stackrel{d}{\rightarrow} Y \sim \mathcal{N}\left(0,\frac{1}{I(\theta)}\right) ~\mbox{or}~ (\hat{\theta}_{MLE}-\theta) \stackrel{d}{\rightarrow} Y' \sim \mathcal{N}\left(0,\frac{1}{nI(\theta)}\right) \,. \] (As usual, some regularity conditions apply that do not concern us at the current time.) We already knew about the mean zero part: we had stated that the MLE is asymptotically unbiased. What’s new is the variance, and the fact that the variance achieves the CRLB, and the identification of the sampling distribution as being normal. (It turns out that the normality of the distribution is related to the Central Limit Theorem, the subject of the next section, and thus this is ultimately not a surprising result.) A sketch of how one would prove the asymptotic normality of the MLE is provided as optional material in Chapter 7.

2.11.1 Maximum Likelihood Estimation for Normal Variance

As above, we will suppose that we are given a data sample $\{X_1,\ldots,X_n\} \stackrel{iid}{\sim} \mathcal{N}(\mu,\sigma^2)$, but here, our desire is to estimate the population variance $\sigma^2$ instead of the population mean. Borrowing the form of the log-likelihood $\ell(\mu,\sigma^2 \vert \mathbf{x})$ from above, we find that \[\begin{align*} \frac{\partial \ell}{\partial \sigma^2} &= -\frac{n}{2}\frac{1}{\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i-\mu)^2 = 0 \\ ~\implies~ \hat{\sigma^2}_{MLE} &= \frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2 = \hat{\sigma^2}_{MLE} = \frac{1}{n}\sum_{i=1}^n (X_i-\bar{X})^2 \,, \end{align*}\] where the last equality follows from the fact that $\hat{\mu}_{MLE} = \bar{X}$. We can see immediately that \[ \hat{\sigma^2}_{MLE} = \frac{n-1}{n}S^2 \,. \] Thus \[ E\left[\hat{\sigma^2}_{MLE}\right] = E\left[\frac{n-1}{n}S^2\right] = \frac{n-1}{n}\sigma^2 \,. \] At this point, one might say, “wait a minute…how do we know $E[S^2] = \sigma^2$? We know this because \[ E[S^2] = \frac{\sigma^2}{n-1} E\left[\frac{(n-1)S^2}{\sigma^2}\right] = \frac{\sigma^2}{n-1} E[W] = \frac{\sigma^2}{n-1} (n-1) = \sigma^2 \,, \] where $W$ is a chi-square-distributed random variable for $n-1$ degrees of freedom; $E[W] = n-1$. Now, to get back to the narrative at hand: the MLE for $\sigma^2$ is a biased estimator, but it is asymptotically unbiased: \[ \lim_{n \rightarrow \infty} \left( E\left[\hat{\sigma^2}_{MLE}\right] - \sigma^2 \right) = \lim_{n \rightarrow \infty} \left( \frac{n-1}{n}\sigma^2 - \sigma^2 \right) = 0 \,. \]

2.11.2 Asymptotic Normality of the MLE for the Normal Population Variance

We will derive the Fisher information and use it to to write down the Cramer-Rao lower bound for the variances of estimates of the normal population variance.

We have that \[\begin{align*} \log f_X(X \vert \sigma^2) &= -\frac{1}{2} \log (2\pi\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2} \\ \frac{\partial}{\partial \sigma^2} \log f_X(X \vert \sigma^2) &= -\frac{1}{2}\frac{1}{2 \pi \sigma^2} 2 \pi + \frac{(x-\mu)^2}{2(\sigma^2)^2} = -\frac{1}{2 \sigma^2} + \frac{(x-\mu)^2}{2(\sigma^2)^2} \\ \frac{\partial^2}{\partial (\sigma^2)^2} \log f_X(X \vert \sigma^2) &= \frac{1}{2 (\sigma^2)^2} - \frac{2(x-\mu)^2}{2(\sigma^2)^3} = \frac{1}{2(\sigma^2)^2} - \frac{(x-\mu)^2}{(\sigma^2)^3} \,, \end{align*}\] The expected value is \[ E\left[ \frac{1}{2(\sigma^2)^2} - \frac{(x-\mu)^2}{(\sigma^2)^3} \right] = \frac{1}{2(\sigma^2)^2} - \frac{1}{(\sigma^2)^3} E[(x-\mu)^2] = \frac{1}{2(\sigma^2)^2} - \frac{1}{(\sigma^2)^3} \sigma^2 = -\frac{1}{2\sigma^4} \,, \] and thus $I(\sigma^2) = -E[-1/(2\sigma^4)] = 1/(2\sigma^4)$, $I_n(\sigma^2) = n/(2\sigma^4)$, and \[ \hat{\sigma^2} \stackrel{d}{\rightarrow} Y \sim \mathcal{N}\left(\sigma^2,\frac{2\sigma^4}{n}\right) \,. \]

2.11.3 Simulating the Sampling Distribution of the MLE for the Normal Population Variance

The result that we derive immediately above is an asymptotic result…but we never actually have an infinite sample size. If our sample size is low, we should expect that the distribution of $\hat{\sigma^2}$ will deviate substantially from the asymptotic expectation, but we cannot know by how much unless we run simulations.

Below, we show how we might run a simulation if we assume that we have $n = 15$ data drawn from a $\mathcal{N}(0,4)$ distribution. See Figure 2.10.

set.seed(101)
n          <- 15
sigma2     <- 4
k          <- 1000
X          <- matrix(rnorm(n*k,sd=sqrt(sigma2)),nrow=k)
sigma2.hat <- (n-1)*apply(X,1,var)/n
cat("The sample mean for sigma2.hat               = ",mean(sigma2.hat),"\n")

## The sample mean for sigma2.hat               =  3.679903

cat("The sample standard deviation for sigma2.hat = ",sd(sigma2.hat),"\n")

## The sample standard deviation for sigma2.hat =  1.362853

empirical.dist <- data.frame(sigma2.hat=sigma2.hat)
x              <- seq(0.1,3*sigma2,by=0.1)
y              <- dnorm(x,mean=sigma2,sd=sqrt(2)*sigma2/sqrt(n))
asymptotic.pdf <- data.frame(x=x,y=y)
ggplot(data=empirical.dist,aes(x=sigma2.hat,y=after_stat(density))) +
  geom_histogram(fill="blue",col="black",breaks=seq(0,10,by=1)) +
  geom_line(data=asymptotic.pdf,aes(x=x,y=y),col="red",lwd=1) +
  labs(x="Estimate of sigma^2") +
  coord_cartesian(xlim=c(0,10)) +
  base_theme

$\label{fig:sigma2hatmlefig}The empirical pdf for $\hat{\sigma^2}_{MLE}$ given $n = 15$ and true variance $\sigma^2=4$. We overlay the asymptotic pdf in red.$

Figure 2.10: The empirical pdf for $\hat{\sigma^2}_{MLE}$ given $n = 15$ and true variance $\sigma^2=4$. We overlay the asymptotic pdf in red.

We see that if $n$ is small, the distribution of the MLE for $\hat{\sigma^2}$ is definitely not normal, but right skew with a mean smaller than the true mean (4) and standard deviation slightly smaller than the true standard deviation (1.46).

2.12 The Central Limit Theorem

Thus far in this chapter, we have assumed that we have sampled individual data from normal distributions. While we have been able to illustrate many concepts related to probability and statistical inference in this setting, the reader may feel that this is unduly limiting: what if the data we sample are not normally distributed? While we do examine the world beyond normality in future chapters, there is one major concept we can discuss now: the idea that statistics computed using non-normal data can themselves have sampling distributions that are at least approximately normal. This is big: it means that, e.g., we can utilize the same machinery for deriving normal-based confidence intervals and for conducting normal-based hypothesis tests, machinery that we outline below, to generate inferences for these (approximately) normally distributed statistics.

The central limit theorem, or CLT, is one of the most important probability theorems, if not the most important. It states that if we have $n$ iid random variables $\{X_1,\ldots,X_n\}$ with mean $E[X_i] = \mu$ and finite variance $V[X_i] = \sigma^2 < \infty$, and if $n$ is sufficiently large, then $\bar{X}$ is approximately normally distributed: \[ \lim_{n \rightarrow \infty} P\left(\frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \leq z \right) = \Phi(z) \,. \] In other words, \[ {\bar X} \stackrel{d}{\rightarrow} Y \sim \mathcal{N}\left(\mu,\frac{\sigma^2}{n}\right) ~~\mbox{or, alternatively,}~~ X_+ = \sum_{i=1}^n X_i \stackrel{d}{\rightarrow} Y_+ \sim \mathcal{N}\left(n\mu,n\sigma^2\right) \,. \] Note that above we added the caveat “if $n$ is sufficiently large.” How large is sufficiently large? The historical rule of thumb is that if $n \gtrsim 30$, then we may utilize the CLT. However, the true answer is that it depends on the distribution from which the data are sampled, and thus that it never hurts to perform simulations to see if, e.g., fewer (or more!) data are needed in a particular setting.

A proof of the CLT that utilizes moment-generating functions is given in Chapter 7.

We note an apparent limitation of the CLT: here, we say that \[ \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma} \stackrel{d}{\rightarrow} Z \sim \mathcal{N}(0,1) \,, \] but in reality we rarely, if ever, know $\sigma$. If we use the sample standard deviation $S$ instead, how does that effect our use of the CLT? The answer is some, but not much…effectively, it does not change the sample size rule of thumb, but it does mean that we will need a few more samples to achieve the same level of accuracy that we would achieve if we know $\sigma$. Refer to the proof in Chapter 7. We see that initially, we standardize the random variables (i.e., transform $X$ to $Z = (X-\mu)/\sigma$) and declare that $E[Z] = 0$ and $V[Z] = 1$. The latter equality no longer holds if we use $Z' = (X-\mu)/S$ instead! But $V[Z']$ does converge to 1 as $n \rightarrow \infty$: $S^2 \rightarrow \sigma^2$ by the Law of Large Numbers, and $S \rightarrow \sigma$ by the continuous mapping theorem. So even if we do not know $\sigma$, the CLT will “kick in” eventually.

Another question the reader may have is whether or not it is the case that, since $\sqrt{n}(\bar{X}-\mu)/\sigma$ converges in distribution to a standard normal random variable, it would also be the case that $\sqrt{n}(\bar{X}-\mu)/S$ will converge in distribution to a $t$-distributed random variable. The short answer: no. $t$-distributed data are such because when data are drawn from a normal distribution, $\bar{X}$ is normally distributed and $(n-1)S^2/\sigma^2$ is chi-square distributed, and the ratio of $\bar{X}-\mu$ and $S/\sqrt{n}$ is a $t$-distributed random variable. If the individual data are drawn from an arbitrary distribution, then $(n-1)S^2/\sigma^2$ will deviate, perhaps markedly, from being chi-square distributed, thus we cannot say anything about the distribution of $\sqrt{n}(\bar{X}-\mu)/S$…other than it eventually converges in distribution to a standard normal random variable.

An important final note: if we have data drawn from a known, non-normal distribution and we need to derive the distribution of $\bar{X}$ in order to, e.g., construct confidence intervals or to perform hypothesis tests, we should not just default to utilizing the CLT! We might be able to compute the exact distribution for $\bar{X}$, for all sample sizes $n$, via, e.g., the method of moment-generating functions.

2.12.1 Computing Probabilities

Let’s assume that we have a sample of $n = 64$ iid data drawn from unknown distribution with mean $\mu = 10$ and finite variance $\sigma^2 = 9$. What is the probability that the observed sample mean will be larger than 11?

This is a good example of a canonical CLT exercise: $\mu$ and $\sigma$ are given to us, but the distribution is left unstated…and $n \geq 30$. This is a very unrealistic: when will we know population parameters exactly? But such an exercise has its place: it allows us to practice the computation of probabilities. \[\begin{align*} P\left( \bar{X} > 11 \right) &= 1 - P\left( \bar{X} \leq 11 \right) \\ &= 1 - P\left( \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma} \leq \frac{\sqrt{n}(11-\mu)}{\sigma}\right) \\ &\approx 1 - P\left( Z \leq \frac{8(11-10)}{3} \right) = 1 - \Phi\left(\frac{8}{3}\right) \,. \end{align*}\] As we know by now, we cannot go any further by hand, so we call upon R:

1 - pnorm(8/3)

## [1] 0.003830381

The probability that $\bar{X}$ is greater than 11 is small: 0.0038.

To reiterate a point made above: if we were given $\mu$ but not $\sigma^2$, we would plug in $S$ instead and expect that the distribution of $\sqrt{n}(\bar{X}-\mu)/S$ would be not as close to a standard normal as the distribution of $\sqrt{n}(\bar{X}-\mu)/\sigma$…it takes $\sqrt{n}(\bar{X}-\mu)/S$ “longer” to converge in distribution to a standard normal as $n$ increases.

2.13 Confidence Intervals

Recall: a confidence interval is a random interval $[\hat{\theta}_L,\hat{\theta}_U]$ that overlaps (or covers) the true value $\theta$ with probability \[ P\left( \hat{\theta}_L \leq \theta \leq \hat{\theta}_U \right) = 1 - \alpha \,, \] where $1 - \alpha$ is the confidence coefficient. We determine $\hat{\theta}$ by solving the following equation: \[ F_Y(y_{\rm obs} \vert \theta) - q = 0 \,, \] where $F_Y(\cdot)$ is the cumulative distribution function for the statistic $Y$, $y_{\rm obs}$ is the observed value of the statistic, and $q$ is an appropriate quantile value that is determined using the confidence interval reference table introduced in section 16 of Chapter 1.

When we construct confidence intervals, we are allowed to utilize any arbitrary statistic, so long as we know the sampling distribution of that statistic (which links the statistic value to the underlying parameter of interest). However, in the context of the normal distribution, there are conventional choices:

$\theta = \mu$, and the variance $\sigma^2$ is known: we use the sample mean $\bar{X}$
$\theta = \mu$, and the variance $\sigma^2$ is unknown: we still use $\bar{X}$
$\theta = \sigma^2$: we use the sample variance $S^2$

These are conventional choices, but we will state here that there are well-established reasons for using these statistics based on the idea of interval length: if we examine the use of two separate statistics for constructing a confidence interval for $\theta$, we generally want to use the one that generates shorter confidence intervals (i.e., the one that indicates the least amount of uncertainty about $\theta$). And $\bar{X}$ and $S^2$ are generally the best statistics to use with normally distributed data.

(We keep saying “generally.” In an academic setting, where there are no outlier data and the data-generating process is clean, $\bar{X}$ and $S^2$ will be the best choices of statistics. In a real-world setting, where data are messy, there may be times where utilizing, e.g., the sample median is warranted. Luckily for us, here we exist in the world of pristine data…the real world can wait.)

Let’s assume that we have collected a sample of $n$ iid data, drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$, and with sample mean and variance $\bar{X}$ and $S^2$, and let’s examine each of the three situations outlined above.

$\theta = \mu$, and the variance $\sigma^2$ is known. We adopt $Y = \bar{X}$, with the observed statistic being $y_{\rm obs} = \bar{x}$. By using the method of moment-generating functions, we have found that \[ Y \sim \mathcal{N}(\mu,\sigma^2/n) \] and thus that \[ F_Y(y) = \frac{1}{2}\left[ 1 + {\rm erf}\left(\frac{\sqrt{n}(y - \mu)}{\sqrt{2}\sigma}\right) \right] \,. \] We can see immediately that we are not solving interval bounds by hand! For instance, \[ \hat{\mu}_U = y_{\rm obs} - \frac{\sigma}{\sqrt{n}}{\rm erf}^{-1}(2q-1) \,, \] where $q = \alpha/2$ and where erf$^{-1}(\cdot)$ is the inverse error function, which is not analytically tractable. Thus we fall back upon numerical methods, utilizing cdf-evaluation functions. In an example below, we provide R code, utilizing pnorm(), for computing confidence interval bounds.
$\theta = \mu$, and the variance $\sigma^2$ is unknown. We adopt $Y = \bar{X}$, with the observed statistic being $y_{\rm obs} = \bar{x}$. Earlier in this chapter, we show that \[ T = \frac{Y-\mu}{S/\sqrt{n}} \sim t(n-1) \,, \] where $t(n-1)$ is the $t$-distribution for $n-1$ degrees of freedom. We immediately sense a problem here: we need to know $F_Y(y)$ for our root-finding algorithm to work; $F_{T(n-1)}(t)$ will not help us! ($T$ subsumes the parameter of interest $\mu$, leaving us unable to solve for $\mu$ when applying the root-finding algorithm.) However, we can determine $F_Y(y)$ through a random-variable transformation. Let $Y = aT+b$, where $a = s_{\rm obs}/\sqrt{n}$ and $b = \mu$. Then \[ F_Y(y) = P(Y \leq y) = P\left(T \leq \frac{y-b}{a}\right) = F_{T,n-1}\left(\frac{y-b}{a}\right) \,. \] Thus, in place of, e.g., $F_Y(y_{\rm obs} \vert \mu_{\alpha/2})$, we plug into the root-finding equation the quantity $F_{T,n-1}((y_{\rm obs}-\mu_{\alpha/2})/(s_{\rm obs}/\sqrt{n}))$; the solution in this case would be \[ \hat{\mu}_U = y_{\rm obs} - \frac{s_{\rm obs}}{\sqrt{n}}F_{T,n-1}^{-1}\left(\frac{\alpha}{2}\right) \,. \] As we cannot work with $F_{T(n-1)}^{-1}\left(\frac{\alpha}{2}\right)$ by band, we will again fall back on numerical methods, and we will show in an example below how to compute intervals using R code.
$\theta = \sigma^2$. We adopt $Y = S^2$, with the observed statistic being $y_{\rm obs} = s_{\rm obs}^2$. Earlier in this chapter, we show that \[ W = \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2 \,, \] where $\chi_{n-1}^2$ is the chi-square distribution for $n-1$ degrees of freedom. Here, we face the same problem that we faced immediately above: we want $F_Y(y)$, but are given $F_{W(n-1)}(w)$. We mitigate this issue in the same manner as above, by employing a random-variable transformation. \[ F_Y(y) = P(Y \leq y) = P\left(W \leq \frac{(n-1)y_{\rm obs}}{\sigma^2}\right) = F_{W,n-1}\left(\frac{(n-1)y_{\rm obs}}{\sigma^2}\right) \] As was the case above when we deal with the $t$ distribution, we end up having inverse cdfs that we cannot evaluate by hand, such as in the expression \[ \widehat{\sigma^2}_U = \frac{(n-1)y_{\rm obs}}{F_{W,n-1}^{-1}(\alpha/2)} \,. \] As has been the case thus far, the evaluation of bounds requires coding, and we show an example of such an evaluation below.

As this point, the reader may say, “all this looks nothing like the confidence interval stuff I learned in introductory statistics.”

And the reader would be correct. The way in which the construction of confidence intervals is described above and in Chapter 1 is different from the way in which it is described in introductory statistics classes and as well as in traditional calculus-based probability and statistical inference classes. In short: we propose a modern approach that is appropriate in the age of computers that yields the same results as traditional approaches.

In introductory statistics classes, equations are generally all that are provided for confidence intervals, such as the one for putting bounds on the normal mean $\mu$ when the variance is known: \[ \bar{x}_{\rm obs} \pm z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \,, \] where $z_{1-\alpha/2} = \Phi^{-1}(1-\alpha/2)$. (We note that historically this equation has also been applied in situations where we have sampled iid normal data but the variance is unknown, if $n \gtrsim 30$. This context generally falls under the textbook heading of “large-sample confidence intervals.” This is just simply wrong! The only reason this is done is because textbook writers have not wanted to include $t$ tables in their books that go beyond $n = 30$.) We would argue that such equations obscure the reality what is actually happening when we construct confidence intervals: “moving” the sampling distribution for a statistic $Y$ back and forth, by changing the value of a parameter $\theta$, until the value of $\theta$ becomes either too low or too high for the observed value $y_{\rm obs}$ to continue to be plausible.

2.13.1 Confidence Interval for the Normal Mean With Variance Known

Here we construct a two-sided confidence interval for the normal mean $\mu$ when the variance $\sigma^2$ is known. We assume that we have collected 25 iid data and that $\sigma^2 = 1$.

set.seed(101)
alpha  <- 0.05
n      <- 25
sigma2 <- 1
X      <- rnorm(n,mean=0,sd=sqrt(sigma2))

f <- function(mu,sigma2,n,y.obs,q)
{
  pnorm(y.obs,mean=mu,sd=sqrt(sigma2/n))-q
}
uniroot(f,interval=c(-100,100),sigma2=sigma2,n=n,y.obs=mean(X),1-alpha/2)$root

## [1] -0.4875524

uniroot(f,interval=c(-100,100),sigma2=sigma2,n=n,y.obs=mean(X),alpha/2)$root

## [1] 0.2964302

In this code above, we search between $\mu = -10,000$ and $\mu = 10,000$ to see where the function $F_Y(y_{\rm obs} \vert \theta_q) - q$ crosses zero, where $q$ is $1 - \alpha/2$ (lower bound) or $\alpha/2$ (upper bound). (Making the range of possible $\mu$ values large has no noticeable impact on computation time and helps ensure that the root will not lie outside the specified interval.) We test the code by sampling $n = 25$ data from a standard normal distribution $\mathcal{N}(0,1)$, and passing the statistic $y_{\rm obs} = \bar{x}_{\rm obs}$ to uniroot(). We find that the interval is $[\hat{\theta}_L,\hat{\theta}_U] = [-0.488,0.296]$, which overlaps the true value. (See Figure 2.11.)

$\label{fig:mukvarci}Sampling distributions for $Y = \bar{X} = \sum_{i=1}^n X_i$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,1)$, and where (left) $\mu=-0.488$ and (right) $\mu=0.296$. We observe $y_{\rm obs} = -0.096$ and we want to construct a 95\% confidence interval. $\mu=-0.488$ is the smallest value of $\mu$ such that $F_Y^{-1}(0.975) = -0.096$, while $\mu=0.296$ is the largest value of $\mu$ such that $F_Y^{-1}(0.025) = -0.096$.$ $\label{fig:mukvarci}Sampling distributions for $Y = \bar{X} = \sum_{i=1}^n X_i$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,1)$, and where (left) $\mu=-0.488$ and (right) $\mu=0.296$. We observe $y_{\rm obs} = -0.096$ and we want to construct a 95\% confidence interval. $\mu=-0.488$ is the smallest value of $\mu$ such that $F_Y^{-1}(0.975) = -0.096$, while $\mu=0.296$ is the largest value of $\mu$ such that $F_Y^{-1}(0.025) = -0.096$.$

Figure 2.11: Sampling distributions for $Y = \bar{X} = \sum_{i=1}^n X_i$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,1)$, and where (left) $\mu=-0.488$ and (right) $\mu=0.296$. We observe $y_{\rm obs} = -0.096$ and we want to construct a 95% confidence interval. $\mu=-0.488$ is the smallest value of $\mu$ such that $F_Y^{-1}(0.975) = -0.096$, while $\mu=0.296$ is the largest value of $\mu$ such that $F_Y^{-1}(0.025) = -0.096$.

We note that to verify the “coverage” of intervals, i.e., the proportion of the intervals that overlap the true value, we can simply wrap the code provided above with a for loop, and save the lower and upper bounds for each generated dataset.

set.seed(101)
num.sim <- 10000
alpha   <- 0.05
n       <- 25
sigma2  <- 1
lower   <- rep(NA,num.sim)
upper   <- rep(NA,num.sim)
f <- function(mu,sigma2,n,y.obs,q)
{
  pnorm(y.obs,mean=mu,sd=sqrt(sigma2/n))-q
}
for ( ii in 1:num.sim ) {
  X         <- rnorm(n,mean=0,sd=sqrt(sigma2))
  lower[ii] <-
    uniroot(f,interval=c(-100,100),sigma2=sigma2,n=n,y.obs=mean(X),1-alpha/2)$root
  upper[ii] <-
    uniroot(f,interval=c(-100,100),sigma2=sigma2,n=n,y.obs=mean(X),alpha/2)$root
}

We can then see how often the true value is within the bounds:

truth    <- 0
in.bound <- (lower <= truth) & (upper >= truth)
cat("The empirical coverage is ",sum(in.bound)/num.sim,"\n")

## The empirical coverage is  0.9518

Here, if lower <= truth, then TRUE is returned, as is the case if upper >= truth. & is the logical and operator, which returns TRUE if the input is TRUE & TRUE and returns FALSE otherwise. R treats TRUE as equivalent to 1 (and FALSE as equivalent to 0), so sum(in.bound) returns the number of final TRUE values, i.e., the number of evaluated intervals that overlap the true value.

Our estimated coverage is 0.9518. We do not expect a value of exactly 0.95 given that we only run a finite number of simulations and thus the coverage will exhibit random variation; however, we can say that this value is “close” to 0.95 and thus almost certainly compatible with 0.95. Once we discuss the binomial distribution in Chapter 3, we will more easily be able to quantify the notion of being “close” to the expected value.

2.13.2 Confidence Interval for the Normal Mean With Variance Unknown

We extend the previous example by assuming the variance is unknown.

set.seed(101)
alpha <- 0.05
n     <- 25
X     <- rnorm(n,mean=0,sd=1)
f <- function(mu,s2,n,y.obs,q)
{
  pt((y.obs-mu)/sqrt(s2/n),n-1)-q
}
uniroot(f,interval=c(-100,100),s2=var(X),n=n,y.obs=mean(X),1-alpha/2)$root

## [1] -0.4483761

uniroot(f,interval=c(-100,100),s2=var(X),n=n,y.obs=mean(X),alpha/2)$root

## [1] 0.2572646

The code above is largely equivalent to that in the previous example, with the only changes being swapping in the call to pt() (and altering the argument in that call) in place of the call to pnorm(), and replacing the known value of $\sigma^2$ (sigma2 in the previous example) with $s_{\rm obs}^2$ (s2). We find that the interval is $[\hat{\theta}_L,\hat{\theta}_U] = [-0.448,0.257]$, which is similar to the interval derived in the last example and which still overlaps the true value. (See Figure 2.12.)

$\label{fig:muuvarci}Sampling distributions for $Y = \bar{X}$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,\sigma^2)$, and where (left) $\mu=-0.448$ and (right) $\mu=0.257$. We observe $y_{\rm obs} = -0.096$ and $s_{\rm obs}^2 = 0.855$ and we want to construct a 95\% confidence interval. $\mu=-0.448$ is the smallest value of $\mu$ such that $F_Y^{-1}(0.975) = -0.096$, while $\mu=0.257$ is the largest value of $\mu$ such that $F_Y^{-1}(0.025) = -0.096$.$ $\label{fig:muuvarci}Sampling distributions for $Y = \bar{X}$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,\sigma^2)$, and where (left) $\mu=-0.448$ and (right) $\mu=0.257$. We observe $y_{\rm obs} = -0.096$ and $s_{\rm obs}^2 = 0.855$ and we want to construct a 95\% confidence interval. $\mu=-0.448$ is the smallest value of $\mu$ such that $F_Y^{-1}(0.975) = -0.096$, while $\mu=0.257$ is the largest value of $\mu$ such that $F_Y^{-1}(0.025) = -0.096$.$

Figure 2.12: Sampling distributions for $Y = \bar{X}$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,\sigma^2)$, and where (left) $\mu=-0.448$ and (right) $\mu=0.257$. We observe $y_{\rm obs} = -0.096$ and $s_{\rm obs}^2 = 0.855$ and we want to construct a 95% confidence interval. $\mu=-0.448$ is the smallest value of $\mu$ such that $F_Y^{-1}(0.975) = -0.096$, while $\mu=0.257$ is the largest value of $\mu$ such that $F_Y^{-1}(0.025) = -0.096$.

2.13.3 Confidence Interval for the Normal Variance

Below, we adapt our confidence-interval code to the problem of estimating an interval for the variance $\sigma^2$.

set.seed(101)
alpha <- 0.05
n     <- 25
X     <- rnorm(n,mean=0,sd=1)
f <- function(sigma2,n,y.obs,q)
{ 
  pchisq(y.obs*(n-1)/sigma2,n-1)-q
}
uniroot(f,interval=c(1.e-6,10000),n=n,y.obs=var(X),1-alpha/2)$root

## [1] 0.4454088

uniroot(f,interval=c(1.e-6,10000),n=n,y.obs=var(X),alpha/2)$root

## [1] 1.413897

The changes made to the code above include swapping in pchisq(), removing the variable mean, removing the variable s2 (as y.obs itself is now $s_{\rm obs}^2$), and adjusting the lower bound on the interval (since $\sigma^2 > 0$). We find that the interval is $[\widehat{\sigma}_L^2,\widehat{\sigma}_U^2] = [0.445,1.414]$, which overlaps the true value. (See Figure 2.13.)

$\label{fig:sigma2ci}Sampling distributions for $Y = S^2$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,\sigma^2)$, and where (left) $\sigma^2=0.445$ and (right) $\sigma^2=1.414$. We observe $y_{\rm obs} = s_{\rm obs}^2 = 0.731$ and we want to construct a 95\% confidence interval. $\sigma^2=0.445$ is the smallest value of $\sigma^2$ such that $F_Y^{-1}(0.975) = 0.731$, while $\mu=1.414$ is the largest value of $\sigma^2$ such that $F_Y^{-1}(0.025) = 0.731$.$ $\label{fig:sigma2ci}Sampling distributions for $Y = S^2$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,\sigma^2)$, and where (left) $\sigma^2=0.445$ and (right) $\sigma^2=1.414$. We observe $y_{\rm obs} = s_{\rm obs}^2 = 0.731$ and we want to construct a 95\% confidence interval. $\sigma^2=0.445$ is the smallest value of $\sigma^2$ such that $F_Y^{-1}(0.975) = 0.731$, while $\mu=1.414$ is the largest value of $\sigma^2$ such that $F_Y^{-1}(0.025) = 0.731$.$

Figure 2.13: Sampling distributions for $Y = S^2$, where $n = 25$ and $X_i \sim \mathcal{N}(\mu,\sigma^2)$, and where (left) $\sigma^2=0.445$ and (right) $\sigma^2=1.414$. We observe $y_{\rm obs} = s_{\rm obs}^2 = 0.731$ and we want to construct a 95% confidence interval. $\sigma^2=0.445$ is the smallest value of $\sigma^2$ such that $F_Y^{-1}(0.975) = 0.731$, while $\mu=1.414$ is the largest value of $\sigma^2$ such that $F_Y^{-1}(0.025) = 0.731$.

2.13.4 Confidence Interval: Using the CLT

Let’s assume that we have a iid sample of $n$ data drawn from the distribution \[ f_X(x) = \theta x^{\theta-1} ~~~~ x \in [0,1] \,. \] This distribution is decidedly not normal. Can we derive a confidence interval for the mean of this distribution, where the mean is \[ E[X] = \int_0^1 \theta x x^{\theta-1} dx = \left. \frac{\theta}{\theta+1} x^{\theta+1} \right|_0^1 = \frac{\theta}{\theta+1} \,? \] If we wanted to try to do this “exactly,” then one workflow would be (a) to attempt to determine the sampling distribution of $Y = \bar{X}$ (utilizing, e.g., moment-generating functions), and then (b) code up a variant of the codes given in the examples above. However, regarding (a): the mgf of the distribution is \[ m_X(t) = E[e^{tX}] = \theta \int_0^1 x^{\theta-1} e^{tx} dx \,. \] This looks suspiciously like a gamma-function integral, except the bounds here are 0 and 1, not 0 and $\infty$…what we actually have (or, technically, what we would have after an appropriate variable substitution) is an incomplete gamma function integral. We cannot easily work with this…and so we turn to the Central Limit Theorem.

If $n \gtrsim 30$, we may write that \[ \bar{X} \stackrel{d}{\rightarrow} Y \sim \mathcal{N}(\mu,s_{\rm obs}^2/n) \,. \] (Recall that the CLT assumes that we know $\sigma^2$, but if we do not and use $S^2$ instead, the result will still hold as $n \rightarrow \infty$.) Thus we would use the confidence interval approach that we describe in the first example above, while swapping in $s_{\rm obs}^2$ for $\sigma^2$.

set.seed(101)
alpha <- 0.05
n     <- 25
theta <- 2   # so the mean is 2/3
X     <- rbeta(n,theta,1)  # theta x^(theta-1) is a beta distribution

f <- function(mu,y.obs,s2,n,q)
{
  pnorm(y.obs,mean=mu,sd=sqrt(s2/n))-q
}
uniroot(f,interval=c(-100,100),y.obs=mean(X),s2=var(X),n=n,1-alpha/2)$root

## [1] 0.5888778

uniroot(f,interval=c(-100,100),y.obs=mean(X),s2=var(X),n=n,alpha/2)$root

## [1] 0.7532408

The evaluated confidence interval is $[0.589,0.753]$ (which overlaps the true value $\mu = 2/3$). Note, however, that because an approximate sampling distribution is being used, the coverage can (and will) deviate from expectation (e.g., 0.95). Let’s see how much the coverage deviates in this one case via simulation:

set.seed(101)
alpha   <- 0.05
num.sim <- 10000
n       <- 25
theta   <- 2
lower   <- rep(NA,num.sim)
upper   <- rep(NA,num.sim)
f <- function(mu,y.obs,s2,n,q)
{
  pnorm(y.obs,mean=mu,sd=sqrt(s2/n))-q
}
for ( ii in 1:num.sim ) {
  X <- rbeta(n,theta,1)  
  lower[ii] <- 
    uniroot(f,interval=c(-100,100),y.obs=mean(X),s2=var(X),n=n,1-alpha/2)$root
  upper[ii] <- 
    uniroot(f,interval=c(-100,100),y.obs=mean(X),s2=var(X),n=n,alpha/2)$root
}

truth    <- 2/3
in.bound <- (lower <= truth) & (upper >= truth)
cat("The estimated coverage is ",sum(in.bound)/num.sim,"\n")

## The estimated coverage is  0.936

The estimated coverage is 0.936; it is not quite 0.95. Thus it appears that are evaluated confidence intervals do not overlap the true value as often as we would expect. Note that we would expect that as $n \rightarrow 0$, the approximation will become progressively worse, and that as $n \rightarrow \infty$, the coverage should asymptotically approach $1 - \alpha$.

2.13.5 Historical Digression: the Pivotal Method

The pivotal method is an often-used analytical method for constructing confidence intervals wherein one “reformats” the problem to solve in such a way that one can ultimately utilize statistical tables to determine interval bounds. In the age of computers, the pivotal method has become unnecessary; the numerical root-finding algorithm presented in this chapter constructs the same intervals (and does so even in situations where the pivotal method cannot be used).

The steps of the pivotal method are as follows:

Determine a pivotal quantity $Y$, a statistic that is a function of both the observed data and the parameter $\theta$, and that has a sampling distribution that does not depend on $\theta$. (Note that, as hinted at above, there is no guarantee that a pivotal quantity exists in any given situation. But there may also be situations where we can define multiple pivotal quantities; if so, it does not matter which we adopt, as they all will help generate the same interval.)

Determine constants $a$ and $b$ such that $P(a \leq Y \leq b) = 1-\alpha$. (Here, we are assuming that we are constructing a two-sided interval.)

Rearrange the terms within the probability statement such that $\theta$ is alone in the middle: the quantities on either end are the interval bounds.

Let’s demonstrate how this plays out in the situation where we sample $n$ iid data from a normal distribution with known variance, and wish to construct an interval for $\mu$. As we know, in this situation, \[ \bar{X} \sim \mathcal{N}\left(\mu,\frac{\sigma^2}{n}\right) \,. \] $\bar{X}$ is not a pivotal quantity: its sampling distribution depends on $\mu$. However… \[ Y = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1) \] is a pivotal quantity, as it depends on both $\bar{X}$ and $\mu$ and has a sampling distribution that does not depend on $\mu$. Thus step 1 is complete.

As for step 2: \[\begin{align*} P(a \leq Y \leq b) = P(Y \leq b) - P(Y \leq a) &= 1-\alpha \\ \Phi(b) - \Phi(a) &= 1-\alpha \,. \end{align*}\] OK…we appear stuck here, except that we can fall back upon the fact that the standard normal distribution is symmetric around zero, and thus we can say that $a = -b$. That, and the fact that symmetry allows us to write that $\Phi(-b) = 1 - \Phi(b)$, allows us to continue: \[\begin{align*} \Phi(b) - \Phi(a) &= 1-\alpha \\ \Phi(b) - (1 - \Phi(b)) &= 1-\alpha \\ 2\Phi(b) - 1 &= 1-\alpha \\ \Phi(b) &= \frac{2-\alpha}{2} = 1 - \frac{\alpha}{2} \\ b &= \Phi^{-1}\left(1 - \frac{\alpha}{2}\right) = z_{1-\alpha/2} \\ \end{align*}\] Historically, at this point, one would make use of a $z$ table to determine that, e.g., $b = z_{0.975} = 1.96$.

So now we move on to step 3: \[\begin{align*} P\left(-z_{1-\alpha/2} \leq Y \leq z_{1-\alpha/2}\right) = P\left(-z_{1-\alpha/2} \leq \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \leq z_{1-\alpha/2}\right) &= 1-\alpha \\ \Rightarrow P\left(\bar{X} - z_{1-\alpha/2}\frac{\sigma}{\sqrt{n}} \leq \mu \leq \bar{X} + z_{1-\alpha/2}\frac{\sigma}{\sqrt{n}}\right) &= 1-\alpha \,. \end{align*}\] Thus our confidence interval is \[ \bar{x}_{\rm obs} \pm z_{1-\alpha/2}\frac{\sigma}{\sqrt{n}} \,. \]

2.14 Hypothesis Testing: Testing for Normality

Recall: a hypothesis test is a framework to make an inference about the value of a population parameter $\theta$. The null hypothesis $H_o$ is that $\theta = \theta_o$, while possible alternatives $H_a$ are $\theta \neq \theta_o$ (two-tail test), $\theta > \theta_o$ (upper-tail test), and $\theta < \theta_o$ (lower-tail test). For, e.g., a one-tail test, we reject the null hypothesis if the observed test statistic $y_{\rm obs}$ falls outside the bound given by $y_{RR}$, which is a solution to the equation \[ F_Y(y_{RR} \vert \theta_o) - q = 0 \,, \] where $F_Y(\cdot)$ is the cumulative distribution function for the statistic $Y$ and $q$ is an appropriate quantile value that is determined using the hypothesis test reference table introduced in section 17 of Chapter 1. Note that the hypothesis test framework only allows us to make a decision about a null hypothesis; nothing is proven.

In a conventional data analysis workflow, we might receive a sample of $n$ iid data without any knowledge about the distribution from which the individual data are drawn, and we might want to test a hypothesis about, e.g., the population mean. There are potentially many hypothesis tests from which to choose, but the ones that are generally the most powerful (i.e., the ones that do the best at discriminating between a null hypothesis and a given alternative hypothesis) are the ones that assume a functional form for the distribution of the individual data. Below, we provide details about tests of population means and variances that are built upon the assumption that the individual data are iid draws from a normal distribution…but before we talk about them, we will introduce hypothesis tests that allow us to decide whether our data are plausibly normally distributed in the first place.

To be clear about the analysis workflow…

If we perform a hypothesis test for which the null hypothesis is that the individual data are normally distributed, and we fail to reject this null hypothesis, we can feel free to use the hypothesis tests in the next two sections, regardless of the sample size $n$.
If we reject the null hypothesis but the sample size is sufficiently large ($n \gtrsim 30$), and we wish to test hypotheses about the population mean, we can fall back on the central limit theorem and construct a variant of the tests presented in the next section.
If we reject the null hypothesis that the data are normally distributed, and the sample size is small or if we wish to test hypotheses about population variances (or both), we should tread carefully if we decide to use the hypothesis testing frameworks presented in the next two sections, as the results we get might be very inaccurate(!).

However, regarding the third point above, we have to keep in mind that as $n$ gets larger and larger, tests of normality become more and more prone to rejecting the null even when the deviation of the true distribution from a normal is small, meaning that the hypothesis tests detailed in the next two sections might actually yield “useful” results! Again, we would need to tread carefully. Alternative approaches include implementing so-called nonparametric tests (which we do not cover in this book), or trying to determine another distribution with which the data are consistent and building a test utilizing its properties, etc.

In the end: if in doubt, simulate. If we can make an assumption about the distribution from which our $n$ iid data are sampled, we can always carry out hypothesis tests numerically. The reader should always keep this in mind: in the age of computers, one rarely if ever needs to “force” an assumption of normality into the solution of a hypothesis testing problem.

Let’s assume that we have collected $n$ iid data from some unknown (and unassumed) distribution $P$. Could $P$ be the normal distribution? There are several varieties of tests whose null hypotheses are “the data are sampled from the [insert name here] distribution.” The most well-known and often-used is the Kolmogorov-Smirnov test, or KS test. (We note that it is not necessarily the most powerful test among those that assess the consistency of data with named distributions, but it is easy to implement and simple to understand. The interested reader should investigate alternatives like, e.g., the Anderson-Darling test and Cramer-von Mises test, and for the specific case of testing for normality, the Shapiro-Wilk test, which we discuss below in an example.)

The empirical cumulative distribution function (or ecdf) for a dataset is defined as \[ F_{X,n}(x) = \frac{1}{n} \left(\mbox{number of observed data} \, \leq x\right) \,. \] We see immediately that $F_{X,n}(x)$ behaves like a cdf, in the sense that $F_{X,n}(-\infty) = 0$ and $F_{X,n}(\infty) = 1$, etc. It is a monotonically increasing step function, with steps of size $1/n$ taken at each datum’s coordinate $x_i$. If we assume a particular distribution as the null distribution, with cdf $F_X(x)$, then the KS test statistic is \[ D_n = \mbox{sup} \vert F_{X,n}(x) - F_X(x) \vert \,. \] We have seen the abbrevation “inf” before, signaling that we are taking the infimum, or smallest value, of a set; here, “sup” stands for supremum, the largest value of a set. So for the KS test, the test statistic is simply the largest distance observed between the empirical and null cdfs; if the null is wrong, we expect this distance to be large. Under the null, $K = \sqrt{n}D_n$ is assumed to be, at least approximately, sampled from the Kolmogorov distribution, whose cdf is \[ P(K \leq k) = \frac{\sqrt{2\pi}}{k} \sum_{x=1}^\infty \exp\left(-\frac{(2x-1)^2\pi^2}{8k^2}\right) \,. \] Having sampled $k_{obs} = \sqrt{n}D_n$, we can use this cdf to establish whether our not the test statistic falls into the rejection region. If it does, we reject the null hypothesis that the individual data are normally distributed.

2.14.1 The Kolmogorov-Smirnov Test

Let’s suppose we are given a dataset that has the empirical distribution given in Figure 2.14.

$\label{fig:ksex} Histogram of a dataset with sample size $n = 30$. Is it plausible that these data are normally distributed?$

Figure 2.14: Histogram of a dataset with sample size $n = 30$. Is it plausible that these data are normally distributed?

The sample statistics for these data are $\bar{X} = 100.6$ and $S = 29.8$. In Figure 2.15, we plot the empirical cdf for these data, overlaying the cdf for a $\mathcal{N}(100.6,29.8^2)$ distribution.

$\label{fig:ksecdf} Empirical cumulative distribution function of our dataset with the cdf for a $\mathcal{N}(100.6,29.8^2)$ distribution overlaid as the red dashed line.$

Figure 2.15: Empirical cumulative distribution function of our dataset with the cdf for a $\mathcal{N}(100.6,29.8^2)$ distribution overlaid as the red dashed line.

The largest difference between the empirical cdf and the overlaid normal cdf in Figure 2.15 occurs at $x \approx 93$. Is this difference large enough for us to decide the data are not normally distributed with the inferred parameters $\mu = 100.6$ and $\sigma = 29.8$?

ks.test(x,"pnorm",100.6,29.8)

## 
##  Exact one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.13858, p-value = 0.5649
## alternative hypothesis: two-sided

By default, ks.test() performs a two-sided test; type ?ks.test in the R console to see how to specify alternatives. According to this test, the $p$-value is 0.56. We formally introduce $p$-values below, in the next section; it suffices to say here that if $p < \alpha$, where $\alpha$ is the user-specified Type I error (typically 0.05), we reject the null, so here we fail to reject the null hypothesis that the data are normally distributed. In truth, the simulated data are sampled from a gamma distribution (see Chapter 4); with a larger sample size, the test becomes more better able to differentiate between normally distributed and gamma-distributed data, so eventually we would reject the null hypothesis.

2.14.2 The Shapiro-Wilk Test

The Shapiro-Wilk test has the null hypothesis that our $n$ iid data are sampled from a normal distribution. The test statistic $W$, which utilizes order statistics (a concept we introduce in Chapter 3), is complicated and will not be reproduced here. It suffices for now for us to make three points.

The Shapiro-Wilk test is better able to reject the null hypothesis that our data are normally distributed than the KS test. (This makes sense because the SW test is built to be more “specific”$-$it is just used to test for normality). In other words, given the same data, the SW test will almost always output a smaller $p$-value than the KS test.

The sampling distribution for $W$ is non-analytic and thus the rejection region is estimated numerically, and thus the Shapiro-Wilk test cannot be used in R with samples of size $>$ 5000. If our sample size is larger, we should randomly sub-sample the data before performing the SW test.

The Shapiro-Wilk test will often reject the null hypothesis of normality even when the data appear by eye to be very nearly normally distributed. This goes back to one of the points made above: we still might be able to glean “useful” results when performing normal-based hypothesis tests using technically non-normal data. Tread carefully, and simulate if necessary.

Below, we demonstrate the use of the Shapiro-Wilk test with the gamma-distributed data of the first example above.

shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.9508, p-value = 0.1776

We see that while the KS test returned a $p$-value of 0.56, the Shapiro-Wilk test returns the value 0.18. This value is still larger than $\alpha = 0.05$, so we still fail to reject the null.

2.15 Hypothesis Testing: Population Mean

Let’s begin by assuming we are given $n$ iid data $\{X_1,\ldots,X_n\}$ sampled from a normal distribution with unknown mean $\mu$ and known variance $\sigma^2$. Because $Y = \bar{X}$ is the MLE for the normal mean, we identify it as a plausible statistic for testing hypotheses about $\mu$. (We will justify this choice in in Chapters 3 and 4 when we discuss the Neyman-Pearson Lemma and the Likelihood Ratio Test.) Refer back to the hypothesis test reference table in section 17 of Chapter 1. We observe that $E[Y] = E[\bar{X}] = \mu$, so $E[Y]$ increases with $\mu$…and thus the rejection regions for a two-tail test are \[ y_{\rm obs} < y_{\rm RR,lo} ~~~ \mbox{and} ~~~ y_{\rm obs} > y_{\rm RR,hi} \,, \] where $y_{\rm RR,lo}$ and $y_{\rm RR,hi}$ are solutions to the equations \[ F_Y(y \vert \mu_o,\sigma^2) - q = 0 \,, \] with $q = \alpha/2$ and $q = 1-\alpha/2$, respectively. Can we solve for rejection region boundaries analytically? The answer is no: the pdf for our test statistic is $\mathcal{N}(\mu,\sigma^2/n)$, and its cdf contains the error function; this is not a cdf we can work with by hand. But finding a numerical solution is straightforward: for instance, \[ y_{\rm RR,lo} = F_Y^{-1}\left(\frac{\alpha}{2} \vert \mu_o,\sigma^2\right) \] can be numerically evaluated using the R function qnorm().

What happens if the variance is unknown? In this case, we would borrow (and slightly amend) a result from the confidence interval section above, namely that \[ F_Y(y) = F_{T(n-1)}\left( \frac{y-\mu_o}{s_{\rm obs}/\sqrt{n}} \right) \,, \] which means that for instance, \[ y_{\rm RR,lo} = \mu_o + \frac{s_{\rm obs}}{\sqrt{n}} F_{T(n-1)}^{-1}\left(\frac{\alpha}{2}\right) \,, \] which can be numerically evaluated using the R function qt(). (See Figure 2.16.)

$\label{fig:normrr}Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) population mean tests (with $\mu_o = 0$ and $s_{\rm obs}^2 = 1$), assuming $\alpha = 0.05$ and $n-1 = 5$ degrees of freedom.$ $\label{fig:normrr}Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) population mean tests (with $\mu_o = 0$ and $s_{\rm obs}^2 = 1$), assuming $\alpha = 0.05$ and $n-1 = 5$ degrees of freedom.$ $\label{fig:normrr}Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) population mean tests (with $\mu_o = 0$ and $s_{\rm obs}^2 = 1$), assuming $\alpha = 0.05$ and $n-1 = 5$ degrees of freedom.$

Figure 2.16: Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) population mean tests (with $\mu_o = 0$ and $s_{\rm obs}^2 = 1$), assuming $\alpha = 0.05$ and $n-1 = 5$ degrees of freedom.

There are two new hypothesis-test-related concepts that we introduce here.

The first is the concept of the p-value. This is the probability of observing a hypothesis test statistic value $y_{\rm obs}$ or a more “extreme” value (i.e., a value even further from the null), given that the null hypothesis is true. We reject the null when $p < \alpha$, the user-specified Type I error. See below for more information about $p$-values. Here, it suffices to say that when we can compute a $p$-value, we should: it is more informative than simply saying that the test statistic does, or does not, lie in the rejection region.

The second new concept that we introduce here is hypothesis test power. In words, it is the probability of rejecting the null hypothesis given an arbitrary parameter value $\theta$: \[ \mbox{power}(\theta) = P(\mbox{reject}~H_o \vert \theta) \,. \] The power is a function of $\theta$, and its value is $1-\beta(\theta,\alpha)$, where $\beta(\theta,\alpha)$ is the Type II error. We note that if we set $\theta = \theta_o$, the null hypothesis value, then the test power is by definition $\alpha$ (the probability of rejecting the null if the null is correct).

Below we provide a second hypothesis test reference table, one that contains details on the computations that one needs to carry out to compute $p$-values and test power. Like the first table, this is not a table to memorize!

Type	$E[Y]$ Increases with $\theta$?	$p$-Value	Test Power
two-tail	yes	2 $\times$ min[$F_Y(y_{\rm obs} \vert \theta_o)$,	$F_Y(y_{\rm RR,lo} \vert \theta)$ +
		$1-F_Y(y_{\rm obs} \vert \theta_o)$]	$1 - F_Y(y_{\rm RR,hi} \vert \theta)$
	no	same as above	same as above

lower-tail	yes	$F_Y(y_{\rm obs} \vert \theta_o)$	$F_Y(y_{\rm RR} \vert \theta)$
	no	$1-F_Y(y_{\rm obs} \vert \theta_o)$	$1-F_Y(y_{\rm RR} \vert \theta)$
upper-tail	yes	$1-F_Y(y_{\rm obs} \vert \theta_o)$	$1-F_Y(y_{\rm RR} \vert \theta)$
	no	$F_Y(y_{\rm obs} \vert \theta_o)$	$F_Y(y_{\rm RR} \vert \theta)$

2.15.1 Testing a Hypothesis About the Normal Mean with Variance Known

Let’s assume that we have sampled $n = 25$ iid data from a normal distribution whose variance is $\sigma^2 = 1$. We wish to test the null hypothesis $H_o : \mu = \mu_o = 2$ versus the alternative $H_a : \mu \neq \mu_o$. What are the rejection regions for this test? What is the $p$-value if we observe $\bar{x}_{\rm obs} = 1.404$? Last, what is the power of this test for $\mu = 1.8$? We assume the level of the test is $\alpha = 0.05$.

To answer each of these questions, we will refer to the hypothesis test reference tables given in Chapter 1 and above.

First we determine the rejection region boundaries: \[ y_{\rm RR,lo} = F_Y^{-1}\left(\frac{\alpha}{2} \vert \mu_o,\sigma^2\right) ~~~ \mbox{and} ~~~ y_{\rm RR,hi} = F_Y^{-1}\left(1-\frac{\alpha}{2} \vert \mu_o,\sigma^2\right) \,, \] where $Y = \bar{X}$. The R function calls are

n        <- 25
alpha    <- 0.05
mu.o     <- 2
sigma2   <- 1
(y.rr.lo <- qnorm(alpha/2,mean=mu.o,sd=sqrt(sigma2/n)))

## [1] 1.608007

(y.rr.hi <- qnorm(1-alpha/2,mean=mu.o,sd=sqrt(sigma2/n)))

## [1] 2.391993

We reject the null hypothesis if $y_{\rm obs}$ is less than 1.608 or greater than 2.392. Note that these boundaries each lie the same distance from $\mu_o = 2$, due to the symmetry of the normal distribution from which $\bar{X}$ is sampled. Also note that the placement of these boundaries does not depend on the observed statistic value $y_{\rm obs} = 1.404$). See Figure 2.17, where the rejection regions are displayed in red.

Next, we compute the $p$-value, noting that we already know that is has to be less than 0.05, since the observed value of 1.404 lies in the rejection region. Using the table above, we infer the relevant code:

n      <- 25
y.obs  <- 1.404
mu.o   <- 2
sigma2 <- 1
2*min(c(pnorm(y.obs,mean=mu.o,sd=sqrt(sigma2/n)),
        1-pnorm(y.obs,mean=mu.o,sd=sqrt(sigma2/n))))

## [1] 0.002882484

The $p$-value is 0.0029. This value is to be interpreted as the probability that we would observe a value as far or farther from 2 as 1.404 (and 2.596, by symmetry) is 0.29 percent, if the null is correct. That is sufficiently small that we conclude that the null hypothesis is incorrect.

Last, we look at the test power assuming $\mu = 1.8$. Again, we use the reference table above and infer the relevant code:

n      <- 25
alpha  <- 0.05
mu.o   <- 2
mu     <- 1.8
sigma2 <- 1
pnorm(y.rr.lo,mean=mu,sd=sqrt(sigma2/n)) + 
  (1-pnorm(y.rr.hi,mean=mu,sd=sqrt(sigma2/n)))

## [1] 0.170075

The test power is 0.170: if $\mu$ is equal to 1.8, then 17.0 percent of the time we would sample a value of our hypothesis test statistic that lies in the rejection region. The farther $\mu$ is from $\mu_o$, the greater the power gets.

$\label{fig:normex1}The sampling distribution $f_Y(y)$ (blue curve), rejection regions (red polygons), and observed statistic value $y_{m obs}$ (vertical green line) for $Y = \bar{X}$ given $n = 25$ iid data that are assumed to be sampled from a $\mathcal{N}(2,1)$ distribution under the null hypothesis. We perform a two-tailed test, with $\alpha = 0.05$, and the variance $\sigma^2 = 1$ is assumed known. The value we observe is in the rejection region, thus we reject the null hypothesis and conclude that $\mu \neq 2$.$

Figure 2.17: The sampling distribution $f_Y(y)$ (blue curve), rejection regions (red polygons), and observed statistic value $y_{m obs}$ (vertical green line) for $Y = \bar{X}$ given $n = 25$ iid data that are assumed to be sampled from a $\mathcal{N}(2,1)$ distribution under the null hypothesis. We perform a two-tailed test, with $\alpha = 0.05$, and the variance $\sigma^2 = 1$ is assumed known. The value we observe is in the rejection region, thus we reject the null hypothesis and conclude that $\mu \neq 2$.

2.15.2 Testing a Hypothesis About the Normal Mean with Variance Unknown

Let’s assume that we have sampled $n = 15$ iid data from a normal distribution whose variance is unknown. We wish to test the null hypothesis $H_o : \mu = \mu_o = 4$ versus the alternative $H_a : \mu > \mu_o$. What is the rejection region for this test? What is the $p$-value if we observe $\bar{x}_{\rm obs} = 4.5$? Last, what is the power of this test for $\mu = 4.5$? We assume the level of the test is $\alpha = 0.05$ and that the sample variance is $s_{\rm obs}^2 = 3$.

To answer each of these questions, we will refer to the hypothesis test reference tables given in Chapter 1 and above.

First we determine the rejection region boundary: \[ y_{\rm RR} = F_Y^{-1}\left(1-\alpha \vert \mu_o,s_{\rm obs}^2\right) \,, \] where $Y = (\bar{X}-\mu_o)/\sqrt{s_{\rm obs}^2/n}$ is a $t$-distributed random variable for $n-1$ degrees of freedom. The R function call is

alpha <- 0.05
n     <- 15
(y.rr  <- qt(1-alpha,n-1))

## [1] 1.76131

We reject the null hypothesis if $y_{\rm obs}$ is greater then 1.761. See Figure 2.18, in which the rejection region is displayed in red.

Next, we compute the $p$-value. Using the table above, we infer the relevant code:

n      <- 15
mu.o   <- 4
s2     <- 3
y.obs  <- (4.5-mu.o)/sqrt(s2/n)
1-pt(y.obs,n-1)

## [1] 0.1411855

The $p$-value is 0.141, which is $> \alpha$. The probability that we would observe a value as far or farther from 4 as 4.5 is 14.1 percent, if the null is correct. We fail to reject the null hypothesis; we cannot conclude that we have sufficient data to reject the idea that $\mu = 4$.

Last, we look at the test power assuming $\mu = 4.5$. Again, we use the reference table above…but there is an issue: a $t$-distribution pdf is not a function of $\mu$! Thus we have to add some extra computational steps to complete the power calculation.

We “map” $y_{\rm RR}$ to $\bar{x}_{\rm RR}$: \[ y_{\rm RR} = \frac{\bar{x}_{\rm RR} - \mu_o}{s_{\rm obs}/\sqrt{n}} ~~~ \Rightarrow ~~~ \bar{x}_{\rm RR} = \mu_o + y_{\rm RR}\frac{s_{\rm obs}}{\sqrt{n}} \,. \]

We determine $y_{\rm RR}'$, the rejection-region boundary given the alternative value $\mu$: \[ y_{\rm RR}' = \frac{\bar{x}_{\rm RR} - \mu}{s_{\rm obs}/\sqrt{n}} \,. \] We plug this value of the rejection region boundary into the power calculation. Here’s the relevant code:

mu         <- 4.5
x.bar.rr   <- mu.o + y.rr*sqrt(s2/n)
y.rr.prime <- (x.bar.rr - mu)/sqrt(s2/n)
1-pt(y.rr.prime,n-1)

## [1] 0.2652206

The test power is 0.265: if $\mu$ is equal to 4.5, then 26.5 percent of the time we would sample a value of our hypothesis test statistic that lies in the rejection region. The farther $\mu$ is from $\mu_o$ (in the positive direction), the higher this percentage gets. See Figure 2.19.

$\label{fig:varex}The sampling distribution $f_Y(y)$ (blue curve), rejection region (red polygon), and observed statistic value $y_{m obs}$ (vertical green line) for $Y = \bar{X}$ given $n = 15$ iid data that are assumed to be sampled from a normal distribution with mean $\mu_o = 4$ and variance unknown under the null hypothesis. We perform an upper-tailed test, with $\alpha = 0.05$. The value we observe is *not* in the rejection region, thus we fail to reject the null hypothesis and conclude that $\mu = 4$ is a plausible value.$

Figure 2.18: The sampling distribution $f_Y(y)$ (blue curve), rejection region (red polygon), and observed statistic value $y_{m obs}$ (vertical green line) for $Y = \bar{X}$ given $n = 15$ iid data that are assumed to be sampled from a normal distribution with mean $\mu_o = 4$ and variance unknown under the null hypothesis. We perform an upper-tailed test, with $\alpha = 0.05$. The value we observe is not in the rejection region, thus we fail to reject the null hypothesis and conclude that $\mu = 4$ is a plausible value.

$\label{fig:varpowcurve}The power curve for the upper-tail test $H_o : \mu = \mu_o = 4$ versus $H_a : \mu > \mu_o$. The red dot at $\mu = 4$ indicates the test power under the null: the power is $\alpha$, as it should be given the definition of test power. We see that $n = 15$ is a sufficient amount of data to clearly disambiguate between $\mu_o = 4$ and, e.g., $\mu \gtrsim 5$. We note that if the alternative hypothesis is $\mu < \mu_o$, the curve would be reversed$-$it would rise towards the left$-$and if the alternative hypothesis is $\mu \neq \mu_o$, the power curve would have a U shape, with the minimum power being $\alpha$ at $\mu = \mu_o$.$

Figure 2.19: The power curve for the upper-tail test $H_o : \mu = \mu_o = 4$ versus $H_a : \mu > \mu_o$. The red dot at $\mu = 4$ indicates the test power under the null: the power is $\alpha$, as it should be given the definition of test power. We see that $n = 15$ is a sufficient amount of data to clearly disambiguate between $\mu_o = 4$ and, e.g., $\mu \gtrsim 5$. We note that if the alternative hypothesis is $\mu < \mu_o$, the curve would be reversed$-$it would rise towards the left$-$and if the alternative hypothesis is $\mu \neq \mu_o$, the power curve would have a U shape, with the minimum power being $\alpha$ at $\mu = \mu_o$.

2.15.3 Hypothesis Testing: Using the CLT

Let’s go back to the example in the confidence interval section above, in which we assumed that we sampled $n$ iid data from the decidedly not normal distribution \[ f_X(x) = \theta x^{\theta-1} ~~ x \in [0,1] \,, \] with $\theta > 0$, and for which $E[X] = \mu = \theta/(\theta+1)$. We found that because we cannot express the sampling distribution of $Y = \bar{X}$ analytically, our only path to determining a confidence interval was by making use of the Central Limit Theorem.

Confidence interval construction and the performance of hypothesis tests are “two sides of the same coin”: in both we are solving for roots of equations, and the only difference is in terms of what is fixed (the observed statistic for confidence intervals and the null hypothesis value $\theta_o$ for hypothesis tests) and what we solve for (the parameter value $\theta$ for confidence intervals and the rejection-region boundaries for hypothesis tests). Thus what “works” for confidence intervals will work for hypothesis tests, meaning that we can utilize the CLT to test the null hypothesis $H_o : \theta = \theta_o$. (So long as $n \gtrsim 30$!)

To find the rejection region(s), $p$-value, and power for tests of the mean when the CLT is involved, we utilize the same codes as we utilize for the “variance known” case (i.e., the ones based on pnorm() and qnorm()), except that instead of plugging in $\sigma^2$ (i.e., sigma2), we would plug in the sample variance $s_{\rm obs}^2$ (i.e., s2). For instance:

n        <- 25
alpha    <- 0.05
theta.o  <- 2
mu.o     <- theta.o/(theta.o+1)
s2       <- 0.0440
(y.rr.lo <- qnorm(  alpha/2,mean=mu.o,sd=sqrt(s2/n)))

## [1] 0.5844416

(y.rr.hi <- qnorm(1-alpha/2,mean=mu.o,sd=sqrt(s2/n)))

## [1] 0.7488918

The rejection region boundaries are thus 0.5844 and 0.7489.

We note that, as is the case for confidence interval coverage, our result is only approximate, i.e., the Type I error rate is not exactly $\alpha$, because the distribution of $\bar{X}$ is not exactly normal. However, as we do in the confidence interval case, we can perform simulations to assess just how far off our actual Type I error rate $\alpha$ is for any given analysis setting.

set.seed(101)
num.sim  <- 10000    # repeat the data-generating process 10,000 times
n        <- 25
alpha    <- 0.05
theta.o  <- 2        # the mean is 3/(3+1) = 3/4
mu.o     <- theta.o/(theta.o+1)
typeIerr <- 0
for ( ii in 1:num.sim ) {
  X       <- rbeta(n,theta.o,1)
  y.obs   <- mean(X)
  s2      <- var(X)
  y.rr.lo <- qnorm(  alpha/2,mean=mu.o,sd=sqrt(s2/n))
  y.rr.hi <- qnorm(1-alpha/2,mean=mu.o,sd=sqrt(s2/n))
  if ( y.obs <= y.rr.lo || y.obs >= y.rr.hi ) typeIerr = typeIerr+1
}
cat("The observed proportion of Type I errors =",typeIerr/num.sim,"\n")

## The observed proportion of Type I errors = 0.064

We observe an empirical Type I error rate of 0.064. This is sufficiently far from the expected value of 0.05 that we conclude that the distribution of the sample mean is only approximately normal.

2.15.4 Testing With Two Data Samples: the t Test

Let’s assume that we are given two independent samples of normal iid data: \[\begin{align*} \{U_1,\ldots,U_{n_U}\} &\sim \mathcal{N}(\mu_U,\sigma_U^2) \\ \{V_1,\ldots,V_{n_V}\} &\sim \mathcal{N}(\mu_V,\sigma_V^2) \,. \end{align*}\] The sample means are $\bar{U}$ and $\bar{V}$, respectively, and we know that \[ \bar{U} \sim \mathcal{N}\left(\mu_U,\frac{\sigma_U^2}{n_U}\right) ~~\mbox{and}~~ \bar{V} \sim \mathcal{N}\left(\mu_V,\frac{\sigma_V^2}{n_V}\right) \,. \] In the two-sample t test, the null hypothesis $H_o$ is that $\mu_U = \mu_V$. Thus a natural test statistic is $\bar{U}-\bar{V}$, which, as the reader may confirm using the method of moment-generating functions, is normally distributed: \[ \bar{U} - \bar{V} \sim \mathcal{N}\left(\mu_U-\mu_V,\frac{\sigma_U^2}{n_U}+\frac{\sigma_V^2}{n_V}\right) \,. \] However, when neither variance is known, we cannot model the observed data with the normal distribution; instead, in Welch’s t test, we assume the following: \[ T = \frac{\bar{U}-\bar{V}}{S_\Delta} = \frac{\bar{U}-\bar{V}}{\sqrt{\frac{S_U^2}{n_U}+\frac{S_V^2}{n_V}}} \sim t({\nu}) \,, \] i.e., $T$ is sampled from a $t$ distribution with $\nu$ degrees of freedom, where $\nu$ is estimated using the Welch-Satterthwaite equation: \[ \nu \approx \frac{S_\Delta^4}{\frac{S_U^4}{n_U^2(n_U-1)} + \frac{S_V^4}{n_V^2(n_V-1)}} \,. \] The rule-of-thumb for this approximation to hold is that both $n_U$ and $n_V$ are larger than 5. Note that in some applications, one might see $\nu$ rounded off to an integer, but this is not necessary: one can work with the $t$ distribution’s probability density function just fine even if $\nu$ in not an integer! (Any rounding off is done to allow one to use $t$ tables to estimate $p$-values.)

Can we “reformat” this test so as to be consistent with how we perform one-sample tests? The answer is yes. Let the test statistic be $Y = \bar{U}-\bar{V} = S_{\Delta} T$, and let the null hypothesis be $H_o : \mu_U - \mu_V = 0$. Then \[ F_Y(y) = P(Y \leq y) = P(S_{\Delta} T \leq y) = P\left(T \leq \frac{y}{S_{\Delta}}\right) = F_{T(\nu)}\left(\frac{y}{S_{\Delta}}\right) \,, \] and thus the rejection-region boundaries are given by \[ y_q = S_{\Delta} F_{T(\nu)}^{-1}(q) \,, \] where $q = \alpha/2$ or $1-\alpha/2$. In R code, these boundaries are given by

y.rr.lo <- S.Delta * qt(alpha/2,nu)
y.rr.hi <- S.Delta * qt(1-alpha/2,nu)

Additionally, the $p$-values and power are given by

p     <- 2*min(c(pt(y.obs/S.Delta,nu),1-pt(y.obs/S.Delta,nu)))
power <- pt((y.rr.lo-mu.Delta)/S.Delta,nu) + 
         (1-pt((y.rr.hi-mu.Delta)/S.Delta,nu))

where mu.Delta is the specified alternative value for $\mu_U-\mu_V$. We leave it as an exercise to the reader to generalize the expressions and code above for cases in which the null hypothesis value differs from zero.

We note that the above code yields equivalent results to the R function t.test().

As a last point: the reader may ask “what do I do if I have three or more samples and I want to test whether all of them have the same mean, with the alternative being that at least one of the means is different?” In such a situation, one would employ a one-way analysis of variance test, or an ANOVA test; we introduce this test in the final section of this chapter.

2.15.5 More About p-Values

As described above, the $p$-value is the probability of observing a hypothesis test statistic value $y_{\rm obs}$ or a more extreme value, i.e., one further from the null value. In mathematical terms, for a continuous sampling distribution, a $p$-value is \[\begin{align*} p = 2 \times \mbox{min}\left[ \int_{-\infty}^{y_{\rm obs}} f_Y(y \vert \theta_o) dy , \int_{y_{\rm obs}}^\infty f_Y(y \vert \theta_o) dy \right] ~~~& \mbox{(two-tail)} \\ p = \int_{-\infty}^{y_{\rm obs}} f_Y(y \vert \theta_o) dy ~~~& \mbox{(lower-tail)} \\ p = \int_{y_{\rm obs}}^\infty f_Y(y \vert \theta_o) dy ~~~& \mbox{(upper-tail)} \,. \end{align*}\] If the null hypothesis is correct, then $p$ is sampled uniformly between 0 and 1, i.e., 0.83 is just as likely to be observed as 0.14 and as 0.23, etc. This makes sense: the probability that $p < \alpha$ is exactly $\alpha$, the Type I error rate. If we set $\alpha$ to 0.05, then when the null is true, $p$ will be less than $\alpha$ five percent of the time, i.e., we will wrongly make the decision to reject a true null on average one time in every 20 hypothesis tests we perform. (See Figure 2.20.)

$\label{fig:pval}The empirical distribution of $p$-values under the null hypothesis, for a lower-tail test with $n = 20$, $\mu_o = 3$, and $\sigma^2 = 1$. We can see that the $p$-values are uniformly distributed between 0 and 1 (with some random variation due to the fact that our simulation contains a finite sample of data). When the null is true, the probability of sampling a $p$-value less than $\alpha$ is thus exactly $\alpha$.$

Figure 2.20: The empirical distribution of $p$-values under the null hypothesis, for a lower-tail test with $n = 20$, $\mu_o = 3$, and $\sigma^2 = 1$. We can see that the $p$-values are uniformly distributed between 0 and 1 (with some random variation due to the fact that our simulation contains a finite sample of data). When the null is true, the probability of sampling a $p$-value less than $\alpha$ is thus exactly $\alpha$.

Because we are performing a simulation, the number of observations falling into each histogram bin is a random variable, and thus we observe some amount of “noise” in Figure 2.20 (i.e., some amount of variation around the dashed red line)…but nevertheless we can visually infer that the sampling distribution for $p$ is uniform (i.e., flat). (For completeness, the number of counts in each bin are collectively sampled from a multinomial distribution, where the probability of a count being recorded in any given bin being the reciprocal of the number of bins. We discuss the multinomial distribution in Chapter 3.)

Now, what if the null hypothesis is not correct? What will happen to the distribution of $p$-values? The answer depends on the circumstance: if we are performing an lower-tail test (with $E[Y]$ increasing with $\theta$) and the true mean is larger than the hypothesized mean, then the $p$-values will skew towards 1: we are less likely to reject the null when its value is closer to the tail of interest than the true value is. If the true mean is smaller than the hypothesized mean, the $p$-values will skew towards 0. See Figure 2.21 for a demonstration of the latter point.

$\label{fig:pvalskew}The empirical distribution of $p$-values when the stated alternative hypothesis is consistent with the truth. We can see that the $p$-values skew towards 0, meaning that when the alternative hypothesis is true, the probability of sampling a $p$-value less than $\alpha$ can be $\gg \alpha$.$

Figure 2.21: The empirical distribution of $p$-values when the stated alternative hypothesis is consistent with the truth. We can see that the $p$-values skew towards 0, meaning that when the alternative hypothesis is true, the probability of sampling a $p$-value less than $\alpha$ can be $\gg \alpha$.

## The proportion of p-values < 0.05 is  0.29707

In Figure 2.21, 29.71% of the counts are observed as having value less than $\alpha = 0.05$…so the test power is 0.2971. As the true mean gets smaller and smaller than $\mu_o = 3$, the number of counts in the first bins increase…which is equivalent to saying that the test power increases.

We will make three important statements about $p$-values here, ones that are good for all readers to remember.

A p-value is not the probability that the null hypothesis is correct! A hypothesis is not a random variable, and there is thus no probability associated with it. A $p$-value is the probability of observing a more extreme hypothesis test statistic than what we actually observe, conditioned on the null hypothesis being correct.

We cannot select the Type I error rate $\alpha$ after computing the p-value. We specify the Type I error rate $\alpha$ before data are analyzed; specifying it afterwards opens up the analysis procedure to bias: “my $p$-value is 0.09…so I’ll set $\alpha = 0.1$ so that I can reject the null.”

We cannot perform multiple hypothesis tests without correcting our result. Suppose that we perform 100 independent hypothesis tests, each with $\alpha = 0.05$. Even if the null is true for every test, we will end up rejecting the null approximately five times. (Not necessarily exactly five times: the number of rejections we observe is a random variable with mean of five.) That we will reject an increasing number of null hypotheses with an increasing number of tests is inevitable: this is a feature of hypothesis testing, not a bug. When performing multiple tests, we need to apply a correction factor to each observed $p$-value, such as a Bonferroni correction (e.g., if $m$ is the number of conducted tests, reset $\alpha$ to $\alpha/m$) or the less conservative false-discovery rate (FDR) correction. Otherwise, we are engaging in p-hacking and data dredging. We discuss correction factors in more detail in Chapter 5.

2.15.6 Test Power: Sample-Size Computation

We can phrase a typical experimental design exercise as follows. “We will collect data that we assume are iid from a normal distribution with specified mean $\mu$ and specified variance $\sigma^2$.” (Alternatively, it could be a specification of $s_{\rm obs}^2$, motivating the use of $t$ distribution.) “How many data do we need to collect to achieve a test power of 0.8, assuming $\alpha = 0.05$ and assuming that the null hypothesis is $H_o : \mu = \mu_o$, but the true value is $\mu > \mu_o$?”

We will begin answering this question using math, and then we will transition to code.

We refer back to the hypothesis test reference tables and note that for an upper-tail test where $E[Y]$ increases with $\theta$, \[ \mbox{power}(\theta) = 1 - F_Y(y_{\rm RR} \vert \mu_o,\sigma^2/n) \,. \] We substitute in our target power value and solve for $n$: \[\begin{align*} 1 - F_Y(y_{\rm RR} \vert \mu_o,\sigma^2/n) &= 0.8 \\ \Rightarrow ~~~ F_Y(y_{\rm RR} \vert \mu_o,\sigma^2/n) &= 0.2 \\ \Rightarrow ~~~ &\mbox{...?} \,. \end{align*}\] The issue is two-fold, as it turns out: first, $n$ appears to the right of the condition symbol, which means that we will not be solving this by hand but rather numerically via root-finding. But there’s another less-readily apparent issue as well: $y_{\rm RR}$ itself depends on $n$! Can we still pursue root-finding? Yes…we simply will have a more complicated expression.

We transition here to code. We refer back to the rejection-region table and note that for an upper-tail normal mean test, $y_{\rm RR}$ is

qnorm(1-alpha/2,mean=mu.o,sd=sqrt(sigma2/n))   # the null enters here

The analogous code for test power is

1 - pnorm(y.hi,mean=mu,sd=sqrt(sigma2/n))   # the alternative enters here

Combining the two expressions, we get

1 - pnorm(qnorm(1-alpha/2,mean=mu.o,sd=sqrt(sigma2/n)),mean=mu,sd=sqrt(sigma2/n))

To solve the problem at hand, we would subtract 0.8 from this expression, set the result equal to zero, and utilize uniroot() to solve for $n$:

# Assume alpha <- 0.05 ; mu.o <- 8 ; mu <- 9 ; sigma2 <- 6
f <- function(n)
{
  pnorm(qnorm(0.975,mean=8,sd=sqrt(6/n)),mean=9,sd=sqrt(6/n)) - 0.2
}
ceiling(uniroot(f,interval=c(1,1000))$root)

## [1] 48

Note how we use the function ceiling() here, which rounds up the result to the next higher integer. We do this because the sample size should be an integer, and because we want the power to be at least 0.8; if we round our result down (via the floor() function), the power will be less than 0.8. In sample size-determination exercises, always round the final result up!

2.16 Hypothesis Testing: Population Variance

When performing hypothesis tests for the population mean $\mu$, above, we adopt $\bar{X}$ as our hypothesis test statistic because it is the MLE for $\mu$; we do not theoretically justify the choice (by showing, e.g., that its use leads us to defining the most powerful test out of a set of possible alternatives). Here, we follow a similar logic: $S^2$ is an unbiased estimator for $\sigma^2$, so we will adopt $Y = S^2$ as our test statistic when testing hypotheses about $\sigma^2$.

Assume that we are given $n$ iid data $\{X_1,\ldots,X_n\}$ sampled from a normal distribution with (unknown) mean $\mu$ and variance $\sigma^2$. Refer back to the table in the hypothesis testing section of Chapter 1. We observe that $E[Y] = E[S^2] = \sigma^2$, so $E[Y]$ increases with $\sigma^2$. Thus the rejection region boundaries for, e.g., a two-tail test are found by solving \[\begin{align*} F_Y(y_{\rm RR,lo} \vert \sigma_o^2) - \frac{\alpha}{2} &= 0 \\ F_Y(y_{\rm RR,hi} \vert \sigma_o^2) - \left(1 - \frac{\alpha}{2}\right) &= 0 \,. \end{align*}\] As is the case for the population mean, we have to borrow (and amend) a result from the confidence interval section to proceed here, namely that \[ F_Y(y) = F_{W(n-1)}\left(\frac{(n-1)y}{\sigma_o^2}\right) \,, \] where $F_{W(n-1)}$ is the cdf for a chi-square distribution for $n-1$ degrees of freedom. So, for instance, \[ y_{\rm RR,hi} = \frac{\sigma_o^2}{n-1}F_{W(n-1)}^{-1}\left(1-\frac{\alpha}{2}\right) \,, \] or, in code,

y.rr.hi <- sigma2.o*qchisq(1-alpha/2,n-1)/(n-1)

(See Figure 2.22.)

$\label{fig:varrr}Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) tests of population variance, assuming $\alpha = 0.1$ and $n-1 = 3$ degrees of freedom.$ $\label{fig:varrr}Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) tests of population variance, assuming $\alpha = 0.1$ and $n-1 = 3$ degrees of freedom.$ $\label{fig:varrr}Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) tests of population variance, assuming $\alpha = 0.1$ and $n-1 = 3$ degrees of freedom.$

Figure 2.22: Illustration of rejection regions (shaded in red) for two-tail (left), lower-tail (center), and upper-tail (right) tests of population variance, assuming $\alpha = 0.1$ and $n-1 = 3$ degrees of freedom.

2.16.1 Testing a Hypothesis About the Normal Population Variance

Let’s assume that we have sampled $n = 14$ iid data from a normal distribution of unknown mean and variance. We wish to test the null hypothesis $H_o : \sigma^2 = \sigma_o^2 = 4$ versus the alternative $H_a : \sigma^2 > \sigma_o^2$. What are the rejection regions for this test? What is the $p$-value if we observe $s_{\rm obs}^2 = 5.5$? Last, what is the power of this test for $\sigma^2 = 8$? We assume the level of the test is $\alpha = 0.05$.

To answer each of these questions, we utilize the hypothesis test reference tables. For the rejection region, the R function call is

alpha    <- 0.05
n        <- 14
sigma2.o <- 4
(y.rr <- (sigma2.o/(n-1))*qchisq(1-alpha,n-1))

## [1] 6.880625

The rejection region boundary is 6.881: we reject the null hypothesis if $y_{\rm obs} = s_{\rm obs}^2 > 6.881$.

Next, we compute the $p$-value, noting that it has to be greater than 0.05, since the observed value of 5.5 does not lie in the rejection region. The relevant code is

y.obs    <- 5.5
1-pchisq((n-1)*y.obs/sigma2.o,n-1)

## [1] 0.1623236

The $p$-value is 0.162, which is indeed $> \alpha$. There is thus a 16.2 percent chance that we would sample a variance value of 5.5 or greater if the null is correct. This is too high a percentage for us to confidently reject the idea that $\sigma^2 = 4$.

Last, we look at the test power assuming $\sigma^2 = 8$:

sigma2   <- 8
1-pchisq((n-1)*y.rr/sigma2,n-1)

## [1] 0.5956563

The test power is 0.596: if $\sigma^2$ is equal to 8, then 59.6 percent of the time we would sample a value of $y_{\rm obs}$ that lies in the rejection region. This is not a particularly high value; if being able to differentiate between $\sigma_o^2 = 4$ and $\sigma^2 = 8$ is an important research goal, we would want to collect more data!

In the previous section, we show in an example how to determine the sample size that is needed to achieve a particular power. Here we will repeat that exercise, but with a twist: we will visualize the power of the test as a function of $n$. Because we are not solving for $n$ for a particular power value, there is no need to utilize uniroot() here; we simply need to loop over calls to qchisq() and pchisq(). (But note that we don’t actually “loop,” as R’s vectorization capabilities allow us to simply pass vectors of sample-size values into the function calls, with vectors of results being returned.) See Figure 2.23.

alpha    <- 0.05
sigma2.o <- 4
sigma2   <- 8
n        <- 2:100
y.rr     <- (sigma2.o/(n-1))*qchisq(1-alpha,n-1)
power    <- 1-pchisq((n-1)*y.rr/sigma2,n-1)

df       <- data.frame(n,power)
ggplot(data=df,aes(x=n,y=power)) +
  geom_line(col="blue",lwd=1) +
  geom_line(data=data.frame(x=c(0,14,14,14),y=c(0.596,0.596,0,0.596)),
            aes(x=x,y=y),col="red",lty=2,lwd=0.85) +
  geom_point(x=14,y=0.596,size=4,col="red") +
  geom_hline(yintercept=0,lty=2,col="blue") +
  geom_hline(yintercept=1,lty=2,col="blue") +
  labs(x="n",y="Test Power") +
  base_theme

$\label{fig:varpow2}The test power as a function of sample size $n$ assuming $\sigma_o^2 = 4$, $\sigma^2 = 8$, and $\alpha = 0.05$. The red dot indicates the test power for $n = 14$, which as shown in the text is 0.596.$

Figure 2.23: The test power as a function of sample size $n$ assuming $\sigma_o^2 = 4$, $\sigma^2 = 8$, and $\alpha = 0.05$. The red dot indicates the test power for $n = 14$, which as shown in the text is 0.596.

2.16.2 Testing With Two Data Samples: the F Test

As was the case above for the two-sample $t$ test, described above, we assume we are given two independent samples of iid data: \[\begin{align*} \{U_1,\ldots,U_{n_U}\} &\sim \mathcal{N}(\mu_U,\sigma_U^2) \\ \{V_1,\ldots,V_{n_V}\} &\sim \mathcal{N}(\mu_V,\sigma_V^2) \,, \end{align*}\] with sample variances $S_U^2$ and $S_V^2$, respectively. Given what we know about the distribution of sample variance values, we can write that \[ W_U = \frac{(n_U-1)S_U^2}{\sigma_U^2} \sim \chi_{n_U-1}^2 ~~\mbox{and}~~ W_V = \frac{(n_V-1)S_V^2}{\sigma_V^2} \sim \chi_{n_V-1}^2 \,. \] The ratio of $W_U/(n_U-1)$ to $W_V/(n_V-1)$ is \[ F = \left. \left(\frac{\frac{(n_U-1)S_U^2}{\sigma_U^2}}{n_U-1}\right) \right/ \left(\frac{\frac{(n_V-1)S_U^2}{\sigma_V^2}}{n_V-1}\right) = \frac{S_U^2/\sigma_U^2}{S_V^2/\sigma_V^2} \sim F_{n_U-1,n_V-1} \,, \] where $F_{n_U-1,n_V-1}$ is the F distribution for $n_U-1$ numerator, and $n_V-1$ denominator, degrees of freedom. Examining the equation above, we see that $F$ is an appropriate statistic for testing hypotheses about the relationship between the variances of both samples.

Let the test statistic be \[ Y = \frac{s_U^2}{s_V^2} \,. \] (Note that it doesn’t matter which sample is identified as $U$ and which is identified as $V$ so long as care is taken to not accidentally “reverse” the samples when coding the test.) As was the case with the two-sample $t$ test, we observe $Y$ but only know the sampling distribution for $F$, so we need to enact a variable transformation: \[ F_Y(y) = P(Y \leq y) = P\left(\frac{\sigma_U^2}{\sigma_V^2}F \leq y \right) = P\left( F \leq \frac{\sigma_V^2}{\sigma_U^2} y \right) = F_{F(n_U-1,n_V-1)}\left(\frac{\sigma_V^2}{\sigma_U^2} y\right) \,. \] Thus the rejection-region boundaries for a two-tail test are given by

y.rr.lo <- (sigmaU2.o/sigmaV2.o) * qf(alpha/2,n.U-1,n.V-1)
y.rr.hi <- (sigmaU2.o/sigmaV2.o) * qf(1-alpha/2,n.U-1,n.V-1)

with the $p$-values and power given by

p     <- 2*min(c(pf(sigmaV2.o*y.obs/sigmaU2.o,n.U-1,n.V-1),
               1-pf(sigmaV2.o*y.obs/sigmaU2.o,n.U-1,n.V-1)))
power <- pf(sigmaV2*y.rr.lo/sigmaU2,n.U-1,n.V-1) + 
         (1-pf(sigmaV2*y.rr.hi/sigmaU2,n.U-1,n.V-1))

where $\sigma_U^2/\sigma_V^2$ is the specified alternative variance ratio. We leave it as an exercise to the reader to derive appropriate expressions for one-tail tests.

We note that the above code yields equivalent results to the R function var.test().

2.17 Simple Linear Regression

Thus far in this chapter, we have assumed that we have been given a sample of iid data $\{X_1,\ldots,X_n\} \sim \mathcal{N}(\mu,\sigma^2)$, or, sometimes, two samples that are independent of one another. Here we will move beyond this assumption to the case where our data consists of tuples $\{(x_1,Y_1),\ldots,(x_n,Y_n)\}$ and where we want to determine (or learn) the association between the variables $x_i$ and $Y_i$. (Assuming there is one! See Figure 2.24. Also, note the use of the term “association”: a relationship might exist, but it is not necessarily the case that $x$ “causes” $Y$; to determine causation, one would utilize methods of causal inference, which are beyond the scope of this book.) In other words, we want to regress $\mathbf{Y}$ upon $\mathbf{x}$.

$\mathbf{x} = \{x_1,\ldots,x_n\}$: the predictor variable (or independent variable or covariate or feature)
$\mathbf{Y} = \{Y_1,\ldots,Y_n\}$: the response variable (or the dependent variable or target)

$\label{fig:linreg}Illustration of the setting for simple linear regression. The data $Y \vert x$ (blue points) are randomly distributed around the true regression line $y \vert x = 4 + x/2$ (red dashed line), with the error term $\epsilon_i \sim \mathcal{N}(y \vert 0,1)$.$

Figure 2.24: Illustration of the setting for simple linear regression. The data $Y \vert x$ (blue points) are randomly distributed around the true regression line $y \vert x = 4 + x/2$ (red dashed line), with the error term $\epsilon_i \sim \mathcal{N}(y \vert 0,1)$.

If the relationship between $\mathbf{x}$ and $\mathbf{Y}$ is deterministic, then there exists a function $f$ such that $y_i = f(x_i)$ for all $i$. As statisticians, we are more interested in learning non-deterministic associations (so-called stochastic, probabilistic, or statistical relationships) between $\mathbf{x}$ and $\mathbf{Y}$. One standard model for the data-generating process is \[ Y_i \vert x_i = f(x_i) + \epsilon_i \,, \] where $\{\epsilon_1,\ldots,\epsilon_n\}$ are random variables with $E[\epsilon_i] = 0$ and uncorrelated variances $V[\epsilon_i] = \sigma^2$ (i.e., the variances are the same for all $i$…they are homoscedastic). (Note that we are not saying anything about the distribution of $\epsilon_i$ yet; we are just stating assumptions about its expected value and variance.) Now we can say why the predictor variable is represented in lower case while the response variable is in upper case: we consider the predictor variable values to be fixed, while the response variable values are random variables because the $\epsilon_i$’s are random variables.

We now suppose that there is a linear relationship between $\mathbf{x}$ and $\mathbf{Y}$ such that \[\begin{align*} Y_i &= \beta_0 + \beta_1 x_i + \epsilon_i \\ E[Y_i] &= E[\beta_0 + \beta_1 x_i + \epsilon_i] = \beta_0 + \beta_1 x_i + E[\epsilon_i] = \beta_0 + \beta_1 x_i \\ V[Y_i] &= V[\beta_0 + \beta_1 x_i + \epsilon_i] = V[\epsilon_i] = \sigma^2 \,. \end{align*}\] This is a simple linear regression model, where the word “simple” follows from the fact that there is only one predictor variable. (Note that this is a linear model because it is linear in the parameters; e.g., $\beta_0 + \beta_1x_i^2$ is also a linear model.) $\beta_0$ is dubbed the intercept and $\beta_1$ is the slope. In analyses, we are generally interested in estimating $\beta_1$, as it tells us the average change in $Y$ given a one-unit change in $x$.

Once a simple linear regression model is learned, we can make predictions \[ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \] and we can compute residuals \[ e_i = Y_i - \hat{Y}_i \] The goal in the estimation process is to make the residuals as small as possible. Since negative values of $e_i$ are as “bad” as positive values, we estimate $\beta_0$ and $\beta_1$ by minimizing the sum of squared errors or SSE: $\sum_{i=1}^n e_i^2$. (Note that this term is often used interchangeably with the term residual sum of squares, or RSS.) This is the so-called ordinary least squares estimator, or OLS. Our use of the OLS estimator for $\beta_0$ and $\beta_1$ (as opposed to, e.g., $\sum_{i=1}^n \vert e_i \vert$) is motivated by the Gauss-Markov theorem, which states that the unbiased OLS estimator is the best estimator if the $\epsilon_i$’s are uncorrelated, have equal variances, and expected values of zero (hence motivating the assumptions stated above). Note that we have still not said anything about the distribution of the $\epsilon_i$’s: the Gauss-Markov theorem does not mandate that they have a particular distribution!

(As an aside: what happens if, e.g., we have data for which the variances are unequal, or heteroscedastic? We lose the ability to make statements like “the OLS estimator is the best linear unbiased estimator.” We can always learn the simple linear model we define above…a computer cannot stop us from doing so. It just may not be the best model choice…for instance, weighted OLS regression, where the weighting given each datum is related to how uncertain its value is, may provide better estimates.)

To estimate $\beta_0$ and $\beta_1$, we take partial derivatives of \[ \sum_{i=1}^n (Y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \] with respect to $\hat{\beta}_0$ and $\hat{\beta}_1$, set the results to zero, and solve for $\hat{\beta}_0$ and $\hat{\beta}_1$. Here, we will quote results, and provide the estimate of $\sigma^2$, while leaving the derivations as exercises to the reader:

Parameter	Estimate	Expected Value	Variance
$\beta_0$	$\bar{Y}-\hat{\beta}_1\bar{x}$	$\beta_0$	$\sigma^2\left(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}\right)$
$\beta_1$	$\frac{S_{xY}}{S_{xx}}$	$\beta_1$	$\frac{\sigma^2}{S_{xx}}$
$\sigma^2$	$\frac{1}{n-2}$ $\sum_{i=1}^n$ $e_i^2$ $=\frac{SSE}{n-2}$	$\sigma^2$	$\frac{2\sigma^4}{n-2}$

The quantities $S_{xx}$ and $S_{xY}$ (and $S_{YY}$) are shorthand for the following: \[\begin{align*} S_{xx} &= \sum_{i=1}^n (x_i-\bar{x})^2 = \left(\sum_{i=1}^n x_i^2\right) - n\bar{x}^2 \\ S_{xY} &= \sum_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y}) = \left(\sum_{i=1}^n x_iY_i\right) - n\bar{x}\bar{Y} \\ S_{YY} &= \sum_{i=1}^n (Y_i-\bar{Y})^2 = \left(\sum_{i=1}^n Y_i^2\right) - n\bar{Y}^2 \,. \end{align*}\]

In Figure 2.25, we display the estimated regression line for the data shown above in Figure 2.24.

$\label{fig:linregest}Same as the previous figure, with the estimated regression line $\hat{y} = 4.52 + 0.46 x$ overlaid (solid red line).$

Figure 2.25: Same as the previous figure, with the estimated regression line $\hat{y} = 4.52 + 0.46 x$ overlaid (solid red line).

Ultimately, we are interested in determining if there is a statistically significant relationship between $\mathbf{x}$ and $\mathbf{Y}$, a question that we can answer using a hypothesis test: \[\begin{align*} H_o&: \beta_1 = 0 \implies Y_i = \beta_0 + \epsilon_i \\ H_a&: \beta_1 \neq 0 \implies Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \end{align*}\] In order to carry out hypothesis testing, however, we have to now make a distributional assumption regarding the $\epsilon_i$’s. The standard assumption is that \[ \epsilon_i \sim \mathcal{N}(0,\sigma^2) \,, \] which means that \[ Y_i \vert x_i \sim \mathcal{N}(\beta_0+\beta_1x_i,\sigma^2) \] Adding this assumption to the model does not affect estimation; if, for instance, we were to derive the MLE estimators for $\beta_0$ and $\beta_1$, we would find that they are identical to the OLS estimators given above.

Does our distributional assumption for the $\epsilon_i$’s allow us to specify the distribution of, e.g., $\hat{\beta}_1$? The answer is yes. We begin by showing that $\hat{\beta}_1$ can be written as a linear function of the response variable values: \[\begin{align*} \hat{\beta}_1 &= \frac{\left(\sum_{i=1}^n Y_i x_i\right) - n \bar{x}\bar{Y}}{\left(\sum_{i=1}^n x_i^2\right) - n\bar{x}^2} \\ &= \frac{1}{k} \left[ \left(\sum_{i=1}^n Y_i x_i\right) - n \bar{x}\bar{Y} \right] \\ &= \frac{1}{k} \left[ \left(\sum_{i=1}^n Y_i x_i\right) - \sum_{i=1}^n \bar{x} Y_i \right] \\ &= \frac{1}{k} \sum_{i=1}^n (x_i - \bar{x}) Y_i = \sum_{i=1}^n \left( \frac{x_i-\bar{x}}{k}\right) Y_i = \sum_{i=1}^n a_i Y_i \,. \end{align*}\] Because the $Y_i$’s are normally distributed, we know (via the method of moment-generating functions) that $\hat{\beta}_1$ is also normally distributed, with mean \[ E[\hat{\beta}_1] = \sum_{i=1}^n a_i E[Y_i] = \cdots = \beta_1 \] and variance \[ V[\hat{\beta}_1] = \sum_{i=1}^n a_i^2 V[Y_i] = \sum_{i=1}^n a_i^2 \sigma^2 = \cdots = \frac{\sigma^2}{\left(\sum_{i=1}^n x_i^2\right) - n\bar{x}^2} \,. \] We can finally perform the hypothesis test. If $\sigma^2$ is unknown, the standardized slope is assumed to be $t$-distributed for $n-2$ degrees of freedom: \[ \frac{\hat{\beta}_1 - \beta_1}{se({\hat{\beta}_1})} = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{V[{\hat{\beta}_1}]}} \sim t_{n-2} \,. \]

To reiterate:

The assumption that $\epsilon_i \sim \mathcal{N}(0,\sigma^2)$ allows us to test hypotheses about, e.g., $\beta_1$. If this assumption is violated, then OLS regression is still the best linear unbiased estimator.
If, in addition, the assumption that $\epsilon_i$’s are homoscedastic and uncorrelated is violated, the OLS regression will no longer be the best linear unbiased estimator, and we would need to seek alternatives, such as weighted OLS regression.
Last…if, in addition, the assumption of $E[\epsilon_i] = 0$ is violated, then using OLS will be a biased estimator. At this point, we would generally conclude that there is a better, nonlinear representation of the data-generating process that we should use.

2.17.1 Correlation: the Strength of Linear Association

We can measure the strength of a linear association via the metric of correlation. We define the correlation between a pair of random variables $X$ and $Y$ as \[ \rho_{XY} = \frac{E[XY] - E[X]E[Y]}{\sqrt{V[X]V[Y]}}\,, \] with $\vert \rho_{XY} \vert \leq 1$. (We will discuss correlation and the related concept of covariance more in depth in Chapter 6. For now, just treat the equation above as a definition. The main thing to keep in mind is that if $X$ and $Y$ are independent random variables, $\rho_{XY} = 0$, while if $X = Y$ or $X = -Y$, $\rho_{XY} = 1$ and $-1$ respectively.) If instead of a pair of random variables we have a set of tuples, we can still define a correlation, which we can estimate using the Pearson correlation coefficient: \[ R = \frac{\sum_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y})}{\left[\sum_{i=1}^n (x_i-\bar{x})^2\right]\left[\sum_{i=1}^n (Y_i-\bar{Y})^2\right]} = \frac{S_{xY}}{\sqrt{S_{xx}S_{YY}}} = \hat{\beta}_1 \sqrt{\frac{S_{xx}}{S_{YY}}} \,. \] The last equality follows from the fact that $\hat{\beta}_1 = S_{xY}/S_{xx}$. It follows from this expression that $R$ and $\hat{\beta}_1$ have the same sign.

Related to the correlation is the coefficient of determination, or $R^2$, a quantity ranging from 0 to 1 that is interpreted as the proportion of the variation in the response variable values explained by the predictor variable values. The closer $R^2$ is to 1, the more strictly linear is the association between $x$ and $Y$.

For completeness, we mention “adjusted $R^2$,” as this is an oft-quoted output from R’s linear model function, lm(). (See below for example usage of lm().) Adjusted $R^2$ attempts to correct for the tendency of $R^2$ to over-optimistically indicate the quality of fit of a linear model. There are several formulae for computing adjusted $R^2$; R uses Wherry’s formula: \[ \mbox{Adj.}~R^2 = 1 - (1-R^2)\frac{n-1}{n-p-1} \,, \] where $p$ is the number of predictor variables (with $p = 1$ for simple linear regression).

2.17.2 The Expected Value of the Sum of Squared Errors (SSE)

Here we show that $SSE/(n-2)$ is an unbiased estimator of $\sigma^2$. This is a non-trivial algebraic exercise, and thus we will leave the calculation of $V[\hat{\sigma^2}] = V[SSE/(n-2)] = 2\sigma^4/(n-2)$ as an exercise to the (masochistic) reader.

We start by writing that \[\begin{align*} SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 &= \sum_{i=1}^n [Y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)]^2 \\ &= \sum_{i=1}^n [Y_i - (\bar{Y} - \hat{\beta}_1 \bar{x} + \hat{\beta}_1 x_i)]^2 \\ &= \sum_{i=1}^n [(Y_i - \bar{Y}) - \hat{\beta}_1 (x_i-\bar{x})]^2 \\ &= \sum_{i=1}^n [(Y_i - \bar{Y})^2 - 2 \hat{\beta}_1 (x_i-\bar{x})(Y_i - \bar{Y}) + \hat{\beta}_1^2(x_i-\bar{x})^2] \\ &= S_{YY} - 2 \hat{\beta}_1 S_{xY} + \hat{\beta}_1^2 S_{xx} \\ &= S_{YY} - 2 \hat{\beta}_1 (S_{xx} \hat{\beta}_1) + \hat{\beta}_1^2 S_{xx} \\ &= S_{YY} - \hat{\beta}_1^2 S_{xx} \,. \end{align*}\] To find the expected value of this difference, we look first at $S_{YY}$ and then at $\hat{\beta}_1^2 S_{xx}$. \[\begin{align*} E[S_{YY}] &= E\left[ \sum_{i=1}^n (Y_i - \bar{Y})^2 \right] \\ &= E\left[ \sum_{i=1}^n \left(\beta_0 + \beta_1x_i + \epsilon_i - \beta_0 - \beta_1 \bar{x} - \bar{\epsilon} \right)^2 \right] \\ &= E\left[ \sum_{i=1}^n \left(\beta_1(x_i -\bar{x}) + (\epsilon_i - \bar{\epsilon}) \right)^2 \right] \\ &= E\left[ \sum_{i=1}^n \beta_1^2(x_i -\bar{x})^2 + \sum_{i=1}^n 2 \beta_1 (x_i - \bar{x})(\epsilon_i - \bar{\epsilon}) + \sum_{i=1}^n (\epsilon_i - \bar{\epsilon})^2 \right] \\ &= E\left[ \beta_1^2 S_{xx} + 2 \beta_1 \sum_{i=1}^n (x_i - \bar{x})(\epsilon_i - \bar{\epsilon}) + \sum_{i=1}^n (\epsilon_i - \bar{\epsilon})^2 \right] \\ &= \beta_1^2 S_{xx} + 2 \beta_1 \sum_{i=1}^n (x_i - \bar{x}) E\left[ \epsilon_i - \bar{\epsilon} \right] + \sum_{i=1}^n E\left[ (\epsilon_i - \bar{\epsilon})^2 \right] \,. \end{align*}\] Since $E[\epsilon_i] = E[\bar{\epsilon}] = 0$, the middle term vanishes. As for the right-most term: \[\begin{align*} \sum_{i=1}^n E\left[ (\epsilon_i - \bar{\epsilon})^2 \right] &= \sum_{i=1}^n E\left[ \epsilon_i^2 - 2\epsilon_i \bar{\epsilon} + \bar{\epsilon}^2 \right] \\ &= \sum_{i=1}^n E\left[ \epsilon_i^2 \right] - 2 \sum_{i=1}^n E\left[\epsilon_i \bar{\epsilon}\right] + \sum_{i=1}^n E\left[\bar{\epsilon}^2\right] \\ &= \sum_{i=1}^n E\left[ \epsilon_i^2 \right] - 2 E\left[ \bar{\epsilon} \sum_{i=1}^n \epsilon_i \right] + E\left[ \sum_{i=1}^n \bar{\epsilon}^2\right] \\ &= \sum_{i=1}^n V[\epsilon_i] + \sum_{i=1}^n \left(E[\epsilon_i]\right)^2 - 2 n E\left[ \bar{\epsilon}^2 \right] + n E\left[ \bar{\epsilon}^2\right] \\ &= \sum_{i=1}^n \sigma^2 - n E\left[ \bar{\epsilon}^2 \right] \\ &= n \sigma^2 - n \frac{\sigma^2}{n} = (n-1)\sigma^2 \,. \end{align*}\] So at this point, we have that \[ E[S_{YY}] = S_{xx} \beta_1^2 + (n-1)\sigma^2 \,. \] Now let’s look at $E[\hat{\beta}_1^2 S_{xx}]$: \[\begin{align*} E\left[\hat{\beta}_1^2 S_{xx}\right] &= S_{xx} E\left[\hat{\beta}_1^2\right] \\ &= S_{xx} \left( V\left[ \hat{\beta}_1 \right] + \left( E\left[ \hat{\beta}_1 \right] \right)^2 \right) \\ &= S_{xx} \left( \frac{\sigma^2}{S_{xx}} + \beta_1^2 \right) \\ &= \sigma^2 + S_{xx} \beta_1^2 \,. \end{align*}\] So \[ E[SSE] = S_{xx} \beta_1^2 + (n-1)\sigma^2 - \sigma^2 - S_{xx} \beta_1^2 = (n-2)\sigma^2 \,. \] Thus $SSE/(n-2)$ is an unbiased estimator of $\sigma^2$.

2.17.3 Linear Regression in R

Below, we show the results of learning a linear regression model using R. The data are the same as used to produce Figure 2.25, with the parameter estimates output by lm() mapping directly to the solid red line in that figure.

# lm(): "linear model"
# y~x : this is a "model formula" that in words translates as
#       "regress the response data y upon the predictor data x"
#       or "estimate beta_0 and beta_1 for the model E[Y] = beta_0 + beta_1 x"
lm.out = lm(y~x)

# show a summary of the model
summary(lm.out)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00437 -0.53068  0.04523  0.40338  2.47660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5231     0.3216  14.063  < 2e-16 ***
## x             0.4605     0.0586   7.859 1.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9792 on 38 degrees of freedom
## Multiple R-squared:  0.6191, Adjusted R-squared:  0.609 
## F-statistic: 61.76 on 1 and 38 DF,  p-value: 1.749e-09

The model residuals $Y_i - \hat{Y}_i$ are summarized via five numbers (minimum, maximum, and median, along with the 25$^{\rm th}$ and 75$^{\rm th}$ percentile values). If, e.g., the median is substantially different from zero, it is possible that the assumption that $Y \vert x$ is normally distributed is violated…and it is possible that the assumption $E[\epsilon_i] = 0$ is violated as well.

Under Coefficients, there are four numerical columns. The first shows $\hat{\beta}_0$ ((Intercept)) and $\hat{\beta}_1$ (x), the second shows the estimated standard errors for these statistics, the third is simply the ratio of the values in the first and second columns, and the fourth is the $p$-value. The third and fourth columns are distribution-specific: the hypothesis test being carried out has a null of zero, and the t value is assumed to be $t$-distributed for $n-2$ degrees of freedom. The $p$-values in the fourth column are $\ll \alpha = 0.05$, which lead us to decide that both the intercept and the slope are truly non-zero. (In our simulation, they are: $\beta_0 = 4$ and $\beta_1 = 0.5$.)

The Residual standard error shows the value of $\hat{\sigma} = \sqrt{SSE/(n-2)}$, while the “38 degrees of freedom” indicates that $n = 40$. We note that in this simulation, $\sigma^2 = 1$.

The meanings behind the R-squared values are explained above in the correlation example. In words, approximately 60% of the variation exhibited by the data is accounted for with the linear model.

Last, the line with the phrase F-statistic shows the result of a hypothesis test in which (in general) the null is that all the $\beta_i$’s (excluding $\beta_0$) are zero and the alternative is that at least one of the $\beta_i$’s is non-zero. Here, since there is only one non-intercept coefficient, the $p$-value is equivalent to the one printed under Coefficients. We will discuss where the F-statistic and DF numbers come from below in the next section.

2.18 One-Way Analysis of Variance

In the simple linear regression setting, our data consists of a continuously distributed predictor variable $\mathbf{x}$ and response variable $\mathbf{Y}$. What if, instead, the predictor variable values come in groups? For instance, we might wish to determine the time a person needs to accomplish a task after eating either chocolate or vanilla ice cream. To do this, we might map the eating of chocolate to a particular value of $x$, say $x = 0$, and the eating of vanilla to another value of $x$, say $x = 1$. Then, if we continue to adopt a simple linear regression framework, we have that \[\begin{align*} Y_i &= \beta_0 + \beta_1 x_i + \epsilon_i \\ E[Y_i \vert x_i = 0] &= \beta_0 \\ E[Y_i \vert x_i = 1] &= \beta_0 + \beta_1 \,. \end{align*}\] To see if there is a difference in task-accomplishment time between groups, we would test the hypothesis $H_o: \beta_1 = 0$ versus $H_a: \beta_1 \neq 0$. As the value of the slope is meaningless, beyond whether or not it is zero (because the mapping from group characteristics to values of $x$ is arbitrary), we would not run a conventional linear regression analysis; rather, if we assume that $\bar{Y} \vert x_i$ is normally distributed, we would run a two-sample $t$ test like the Welch’s $t$ test described earlier in this chapter to determine whether $\bar{Y} \vert x_i=0$ is drawn from a distribution with the same population mean as $\bar{Y} \vert x_i=1$.

What if there are more than two groups (or treatments)? For instance, what if we redo the experiment but add strawberry ice cream? When the number of groups is greater than two, we move from the two-sample $t$ test to one-way analysis of variance (or one-way ANOVA; the “one-way” indicates that there is a single predictor variable, which in our example is ice cream flavor). See Figure 2.26.

$\label{fig:anova}Illustration of the setting for a one-way analysis of variance. To the left, the spread of the data within each group is large compared to the differences in means between each group, so an ANOVA is less likely to reject the null hypothesis that $\mu_0 = \mu_1 = \mu_2$. To the right, the spread of data within groups is small relative to the differences in means between groups, so an ANOVA is more likely to reject the null hypothesis.$ $\label{fig:anova}Illustration of the setting for a one-way analysis of variance. To the left, the spread of the data within each group is large compared to the differences in means between each group, so an ANOVA is less likely to reject the null hypothesis that $\mu_0 = \mu_1 = \mu_2$. To the right, the spread of data within groups is small relative to the differences in means between groups, so an ANOVA is more likely to reject the null hypothesis.$

Figure 2.26: Illustration of the setting for a one-way analysis of variance. To the left, the spread of the data within each group is large compared to the differences in means between each group, so an ANOVA is less likely to reject the null hypothesis that $\mu_0 = \mu_1 = \mu_2$. To the right, the spread of data within groups is small relative to the differences in means between groups, so an ANOVA is more likely to reject the null hypothesis.

Let the index $i$ denote a particular group (chocolate ice-cream eaters, etc.) and let the index $j$ denote different data within the same group. Let $n_i$ denote the sample size within each group, and let $k$ denote the total number of groups. The total sum of squares is defined as the sum of squared differences between $Y_i$ and $\bar{Y}$, which in this setting can be expressed via a double summation, one over data within a group, and another over groups: \[ \mbox{Total SS} = \sum_{i=1}^k \sum_{j=1}^{n_i} (Y_{ij} - \bar{Y})^2 = \sum_{i=1}^k \sum_{j=1}^{n_i} (Y_{ij}-\bar{Y}_{i\bullet})^2 + \sum_{i=1}^k n_i(\bar{Y}_{i\bullet}-\bar{Y})^2 = SSE + SST \,, \] where $\bar{Y}_{i\bullet}$ is the average value for group $i$, $SSE$ has its usual meaning as the sum of squared errors (although here $\hat{\beta}_0+\hat{\beta}_1x_i$ is replaced with the estimated group mean $\bar{Y}_{i\bullet}$), and $SST$ is the sum of squared average treatment effects. If the $SSE$ value is large relative to the $SST$ value, then we are in a situation where the data are spread widely within each group relative to the spread in values of the group means. In such a situation, it would be difficult to detect a statistically significant difference between the group means. (See the left panel of Figure 2.26.) On the other hand, if the $SST$ value is large relative to the $SSE$ value, then the group means have a wide spread compared to the data in each group. In such a situation, it would be much easier to detect a statistically significant difference between the group means. (See the right panel of Figure 2.26.) This reasoning motivates using the ratio \[ \frac{SST}{SSE} \] as the hypothesis test statistic. However, we do not know its distribution…or at least, not yet. We do know is that under the null, \[\begin{align*} W_T = \frac{SST}{\sigma^2} &\sim \chi_{k-1}^2 \\ W_E = \frac{SSE}{\sigma^2} &\sim \chi_{n-k}^2 \,. \end{align*}\] And we know that \[ F = \frac{W_1/\nu_1}{W_2/\nu_2} \sim F_{\nu_1,\nu_2} \,, \] i.e., the ratio of chi-square distributed random variables, each divided by their respective number of degrees of freedom, is $F$-distributed for $\nu_1$ numerator and $\nu_2$ denominator degrees of freedom. Thus \[ F = \frac{SST/(k-1)}{SSE/(n-k)} = \frac{MST}{MSE} \sim F_{k-1,n-k} \,. \] In the ANOVA hypothesis test, the null hypothesis $H_o$ is that $E[Y_1] = E[Y_2] = \cdots = E[Y_k]$, and the alternative hypothesis is that at least one mean is different from the others. If we test at level $\alpha$, we reject the null hypothesis if $F > F_{1-\alpha,k-1,n-k}$. In R, the boundary of the rejection region would be given by qf(1-alpha,k-1,n-k) while the $p$-value would be given by 1-pf(F.obs,k-1,n-k).

2.18.1 One-Way ANOVA: the Statistical Model

In the notes above, we build up the test statistic for the one-way ANOVA, but we do not discuss the underlying statistical model. We can write down the model as \[ Y_{ij} = \mu + \tau_i + \epsilon_{ij} \,, \] where $i$ denotes the treatment group and $j$ denotes an observed datum within group $i$. $\mu$ is the overall mean response, while $\tau_i$ is the deterministic effect of treatment in group $i$. ($\tau$ is the Greek letter tau, which is pronounced “tao” [rhyming with “ow,” an expression of pain]. Recall that by “deterministic” we mean that $\tau_i$ is not random.) The error terms $\epsilon_{ij}$ are independent, normally distributed, and homoscedastic with variance $\sigma^2$.

Recall that $\bar{Y}_{i\bullet}$ is the sample mean within group $i$. We know it is distributed normally (by assumption) but what are its expected value and variance?

The expected value is \[\begin{align*} E[\bar{Y}_{i\bullet}] &= E\left[\frac{1}{n_i}\sum_{j=1}^{n_i} Y_{ij}\right] = \frac{1}{n_i}\sum_{j=1}^{n_i} E[Y_{ij}] \\ &= \frac{1}{n_i}\sum_{j=1}^{n_i} E[\mu + \tau_i + \epsilon_{ij}] = \mu + \tau_i + E[\epsilon_{ij}] = \mu+\tau_i \,, \end{align*}\] while the variance is \[ V[\bar{Y}_{i\bullet}] = V\left[\frac{1}{n_i}\sum_{j=1}^{n_i} Y_{ij}\right] = \frac{1}{n_i}\sum_{j=1}^{n_i^2} V[Y_{ij}] = \frac{1}{n_i^2}\sum_{j=1}^{n_i} V[\mu + \tau_i + \epsilon_{ij}] = \frac{n_i V[\epsilon_{ij}]}{n_i^2} = \frac{\sigma^2}{n_i} \,. \] Thus $\bar{Y}_{i\bullet} \sim \mathcal{N}(\mu+\tau_i,\sigma^2/n_i)$.

2.18.2 Linear Regression in R: Redux

The last example of the last section showed the output from lm(), R’s linear model function:

summary(lm.out)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00437 -0.53068  0.04523  0.40338  2.47660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5231     0.3216  14.063  < 2e-16 ***
## x             0.4605     0.0586   7.859 1.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9792 on 38 degrees of freedom
## Multiple R-squared:  0.6191, Adjusted R-squared:  0.609 
## F-statistic: 61.76 on 1 and 38 DF,  p-value: 1.749e-09

In that example, we indicated that the numbers on the line beginning F-statistic would make more sense after we covered one-way analysis of variance.

There are two coefficients in this model, which is analogous to two “treatment groups.” Hence $k = 2$, $SST/\sigma^2 \sim \chi_{k-1=1}^2$, and $SSE/\sigma^2 \sim \chi_{n-k=38}^2$. This explains on 1 and 38 DF. The $F$ statistic is $MST/MSE = 61.76$, which under the null is sampled from an $F$ distribution with 1 numerator and 38 denominator degrees of freedom. This is an extreme value under the null, as indicated by the $p$-value 1.749 $\times 10^{-9}$…so we conclude that the true value of the slope $\beta_1$ is not zero.

But there’s more. To access more information about the $F$ test that is carried out, we can call the anova() function.

anova(lm.out)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 59.208  59.208  61.756 1.749e-09 ***
## Residuals 38 36.432   0.959                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This function call outputs information we could not see above: the $SST$ (the first row of Sum Sq), $SSE$ (second row), $MST$ (the first row of Mean Sq), and $MSE$ (second row). The F value is the ratio $MST/MSE$; here, that is 59.208/0.959 = 61.756.

2.19 The Exponential Family of Distributions

We conclude this chapter by placing the normal distribution into a larger context: the exponential family of distributions, which encompasses the normal as well as, e.g., the binomial, beta, Poisson, and gamma distributions, the ones that serve as the foci for chapters 3 and 4. Knowing about the exponential family and its properties will help us better understand concepts of statistical inference that we introduce in those chapters, such as the concept of sufficiency.

Let’s assume that we are examining a distribution $P$ that has one freely varying parameter $\theta$. $P$ is a member of the exponential family of distributions if we can express, e.g., its probability density function as \[\begin{align*} f_X(x \vert \theta) = h(x) \exp\left[\eta(\theta) T(x) - A(\theta)\right] \,. \end{align*}\] (This same result holds for a mass function $p_X(x \vert \theta)$ as well. Note that $\theta$ cannot be a domain-specifying parameter, so a distribution whose domain is, e.g., $x \in [0,\theta]$ cannot belong to the exponential family.)

Let’s see how this works with a normal distribution of known variance: \[\begin{align*} f_X(x \vert \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{1}{2\sigma^2}(x-\mu)^2\right] &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{x^2}{2\sigma^2}\right) \exp\left(\frac{x\mu}{\sigma^2}\right) \exp\left(-\frac{\mu^2}{2\sigma^2}\right) \,. \end{align*}\] From this, we can directly read off the functions that together comprise the exponential family form: \[\begin{align*} h(x) &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{x^2}{2\sigma^2}\right) \\ \eta(\mu) &= \frac{\mu}{\sigma^2} \\ T(x) &= x \\ A(\mu) &= \frac{\mu^2}{2\sigma^2} \,. \end{align*}\] We note that we can also define $\eta(\mu) = \mu$ and $T(x) = x/\sigma^2$. Ultimately, the choice of where to place any constants (either within the $\eta(\cdot)$ function or within $T(\cdot)$ function) makes no difference when we are performing inferential tasks.

2.19.1 Exponential Family with a Vector of Parameters

The more general functional form of the exponential family that we use when a distribution has $p \geq 2$ parameters is \[\begin{align*} f_X(x \vert \boldsymbol{\theta}) = h(x) \exp\left[\left(\sum_{i=1}^p \eta_i(\boldsymbol{\theta}) T_i(x)\right) - A(\boldsymbol{\theta})\right] \,. \end{align*}\] For the normal distribution with unknown mean and variance, we can examine the expression \[\begin{align*} f_X(x \vert \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{x^2}{2\sigma^2}\right) \exp\left(\frac{x\mu}{\sigma^2}\right) \exp\left(-\frac{\mu^2}{2\sigma^2}\right) \,, \end{align*}\] and, after noting that \[\begin{align*} \frac{1}{\sqrt{2\pi\sigma^2}} = \exp\left(\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)\right) = \exp\left(-\frac12 \log\left(2 \pi \sigma^2\right) \right) \,, \end{align*}\] we can read off the constituent functions: \[\begin{align*} h(x) &= 1 \\ \eta_1(\mu,\sigma^2) &= -\frac{1}{2\sigma^2} \\ T_1(x) &= x^2 \\ \eta_2(\mu,\sigma^2) &= \frac{\mu}{\sigma^2} \\ T_2(x) &= x \\ A(\mu,\sigma^2) &= \frac{\mu^2}{2\sigma^2} + \frac12 \log\left(2 \pi \sigma^2\right) \,. \end{align*}\]

2.20 Exercises

We are given that $X \sim \mathcal{N}(4,4)$. (a) What is $P(X > 2 \vert X > 0)$, in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$? (b) Determine the value $c$ such that $P(X < c \vert X < 4) = 0.5$; leave your answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$.
What is $E[e^{-Z^2/2}]$ if $Z \sim \mathcal{N}(0,1)$?
A filtering machine only passes through widgets that have masses between 4 and 6. Assume that the mass of a widget is a random variable $M \sim \mathcal{N}(5,4)$. (a) What is the probability that the filtering machine will reject any given widget? Leave your answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$, both here and in part (b). (b) Now assume that $\sigma$ is unknown. What is $\sigma$ such that the probability of rejection is 0.02?
We are given that $X_1$ and $X_2$ are iid draws from a normal distribution with mean $\mu$ and variance $\sigma^2$. What is the value of $c$ such that the probability of $X_1 < \mu - c$ and $X_2 > \mu$ is 1/4? Leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$.
The moment-generating function for the Laplace distribution is \[\begin{align*} m(t) = \frac{\exp(a t)}{1 - b^2 t^2} \,. \end{align*}\] Derive $E[X]$ for this distribution.
We are given the probability mass function shown below. (a) Derive the moment-generating function for this distribution. (b) Use the derived mgf to compute $E[X]$.

$x$	0	1	2
$p_X(x)$	0.2	0.4	0.4

We are given the following probability density function: \[\begin{align*} f_X(x) = ke^{-x} \,, \end{align*}\] for $x \in [0,a]$. $k$ is a constant that need not be evaluated. (a) Write down the moment-generating function associated with this pdf. (b) Determine $E[X]$ for this pdf.
We are given the two probability mass functions shown below. (a) Write down the moment-generating functions for both $X_1$ and $X_2$. (b) Write down the moment-generating function for the random variable $Y = X_1 + 2X_2$.

$x_1$	$p_{X_1}(x_1)$	$x_2$	$p_{X_2}(x_2)$
0	0.4	0	0.2
1	0.6	1	0.8

The Rademacher distribution is \[\begin{align*} p_X(x) = \left\{ \begin{array}{cc} \frac12 & x = -1 \\ \frac12 & x = 1 \end{array} \right. \,. \end{align*}\] (a) Derive the moment-generating function for a Rademacher random variable. (b) Assume that we have drawn $n$ iid data from the Rademacher distribution. Write down the moment-generating function for the sample mean, $\bar{X} = (\sum_{i=1}^n X_i)/n$.
We are given the following probability density function: \[\begin{align*} f_X(x) = \left\{ \begin{array}{cc} 1/3 & x \in [0,1/2) \\ 2/3 & x \in [1/2,1] \end{array} \right. \,. \end{align*}\] Derive the moment-generating function for $X$.
We sample $n = 16$ iid data from a normal distribution with mean 5. The sample variance is 9. (a) Compute $P(3 \leq \bar{X} \leq 6)$. Leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$. (b) We now sample 49 iid data from an exponential distribution with mean 5 and variance 25. Use the Central Limit Theorem to compute an approximate value for $P(3 \leq \bar{X} \leq 6)$; leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$.
We sample two iid data, $X_1$ and $X_2$, from a normal distribution with mean $\mu = 1$ and variance $\sigma^2 = 1$. (a) Derive the distribution of $X_1 - X_2$; provide both a name (or appropriate symbol) and the parameter value(s) for the distribution. (b) Derive $P(X_1 \geq X_2)$. (c) Now assume $\sigma^2$ is unknown. For what value of $\sigma$ will the following hold: $P(X_1 \leq 0 \cup X_1 \geq 2) = 0.2$? Utilize standardization and symmetry; the answer should have a single $\Phi$ or $\Phi^{-1}$ symbol in it.
Derive the distribution for $\bar{X}-\mu$, assuming that we sample $n$ iid data from a normal distribution with mean $\mu$ and variance $\sigma^2$. Provide the distribution’s name or symbol and its parameter value(s).
We are given $n=16$ iid data sampled according to a normal distribution with unknown mean and variance. The sample standard deviation is $S = 2$. Write down an expression for the probability that $\bar{X}$ is within 1 of $\mu$, i.e. write down an expression of the form $P(a \leq X \leq b)$, stating clearly what family of distributions $X$ is sampled from as well as the values of the numeric constants $a$ and $b$.
The time to complete a particular task is normally distributed with mean 30 hours. What is the value of $\sigma$ such that 90 percent of workers take at least 25 hours to complete the task? Leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$.
We are given that \[\begin{align*} f_X(x) = 3x^2 \end{align*}\] for $x \in [0,1]$, and that $U = X^3$. Write down $f_U(u)$.
Let the random variable $X$ be uniformly distributed on the interval [0,1], and let $W = X^2$. What is $P(1/4 \leq W \leq 1/2)$?
We are given that the random variable $X$ is sampled according to the following probability density function: \[\begin{align*} f_X(x) = e^{-x} \,, \end{align*}\] for $x \geq 0$. Assume $U = X^2+4$. Derive the pdf $f_U(u)$.
We are given that the random variable $X$ is sampled according to the probability density function \[\begin{align*} f_X(x) = \frac{1}{\beta} e^{-x/\beta} \end{align*}\] Let $U = -2 X$. (a) What is the pdf for $U$? (b) Is $U$ sampled from an exponential distribution?
We are given that \[\begin{align*} f_X(x) = 1/2 \end{align*}\] for $x \in [-1,1]$, and that $U = X^2-1$. (a) Write down $f_U(u)$. (b) Compute $E[U+1]$.
We are told that the random variable $Y$ is sampled according to the probability density function \[\begin{align*} f_X(x) = 3x^2 \end{align*}\] for $x \in [0,1]$. Let $U = e^{-X}$. Derive $f_U(u)$.
We are given the following probability density function: \[\begin{align*} f_X(x) = \theta x^{\theta-1} \,, \end{align*}\] for $x \in [0,1]$ and $\theta > 0$. (a) Let $U = \exp(X)$. What is $f_U(u)$? (b) Let $U = 2X$. What is $P(U \geq 1)$?
We are to sample $n = 11$ iid data from a normal distribution of unknown mean and variance, and we wish to compute what $P(1 \leq S^2 \leq 2)$ would be if we assume that $\sigma^2 = 2$. Write down an expression of the form $P(a \leq X \leq b)$, where $X$ is a statistic with a known sampling distribution. State the distribution that $X$ is sampled according to, the parameter value(s) for that distribution, and the values of the numeric constants $a$ and $b$.
A client asks us to compute $P(a \leq {\bar X}\sqrt{n}/S \leq b)$, where $a$ and $b$ are constants, and ${\bar X}$, $S$, and $n$ are the sample mean, standard deviation, and size respectively. (The client adds that the individual $X$’s are all iid samples from a normal with mean $\mu$ and variance $\sigma^2$.) Leave appropriate cdf or inverse cdf symbols in the answer, while clearly stating what distribution the cdf corresponds to, and any parameter values for that distribution.
We sample $n$ iid data from a normal distribution with known mean $\mu$. (a) Write an expression for $P(\bar{X}/S \leq a)$ that includes a(n inverse) cdf symbol. (If appropriate, indicate the number of degrees of freedom.) (b) Now assume we don’t know the sample size but we do know the standard deviation $\sigma$. Determine the sample size such that $P(\bar{X}/\sigma \leq a) = b$. The answer should contain a(n inverse) cdf symbol. (If appropriate, indicate the number of degrees of freedom. Note that since we are not looking for a numerical answer, just solve for $n$, and do not indicate that, e.g., we would round the answer up to the nearest integer.)
We are given the following probability density function: \[\begin{align*} f_X(x) = \theta (1-x)^{\theta - 1} \,, \end{align*}\] where $x \in [0,1]$ and $\theta > 0$. (a) Derive the Fisher information $I(\theta)$. (b) Write down the Cramer-Rao Lower Bound on the variance of $\hat{\theta}$, assuming we have sampled $n$ iid data according to the distribution given above.
Let $X_1,\ldots, X_n$ be iid data sampled according to the following probability density function: \[\begin{align*} f_X(x) = \frac{1}{\theta} e^x e^{1/\theta} e^{-e^x/\theta} \,. \end{align*}\] For this pdf, $x \geq 0$ and $\theta > 0$. Also note that $E[e^X] = \theta+1$ and that $V[e^X] = \theta^2$. (a) Find the maximum likelihood estimate for $\theta$. (b) Find the Cramer-Rao Lower Bound for the variance of unbiased estimators for $\theta$. Does the MLE achieve the CRLB?
Find the asymptotic distribution of the maximum likelihood estimate for $p$, given that we sample $n$ iid data according to a Bernoulli distribution: \[\begin{align*} f_X(x) = p^x (1-p)^{1-x} \,, \end{align*}\] where $x \in \{0,1\}$ and $p \in [0,1]$.
We collect $n$ iid data that are distributed according to the following probability density function: \[\begin{eqnarray*} f_X(x) = \frac{a}{x^{a+1}} \,, \end{eqnarray*}\] where $a > 1$ and $x \geq 1$. (a) Derive the Fisher information $I(a)$ associated with a single datum drawn from this distribution. (b) Derive the Cramer-Rao Lower Bound on the variance of unbiased estimators for $a$. (c) In the limit $n \rightarrow \infty$, what is the distribution for $\hat{a}_{MLE}$? Give both the name (or a symbol) for the distribution, and its paramater value(s).
We inspect $n$ = 100 2-by-4’s in a lumber yard. Let $X_i$ be the length of the $i^{\rm th}$ 2-by-4, and assume all the $X_i's$ are independent and identically distributed. The population mean is $\mu = 10$ and the population standard deviation for the length is $\sigma = 5$. (a) Write down an expression for the approximate probability that $\bar{X}$ is more than 1 unit away from $\mu$. Leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$. (b) Let $X_+$ be the sums of the lengths of all 100 2-by-4’s. Specify the family of distributions is $X_+$ approximately sampled from, and evaluate and write down its parameter values.
Workers A and B both perform a task. The time to accomplish this task is a random variable with $\mu_A = 40$ and $\sigma_A = 6$ for worker A, and $\mu_B = 38$ and $\sigma_B = 6$ for worker B. Both workers repeat the task $n = 36$ times. What is the probability that the average time to task completion for worker A is less than that for worker B? Leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$.
We sample $n = 64$ data according to a probability density function with mean $\mu = 10$ and standard deviation $\sigma = 4$. What is the (approximate) probability that $\bar{X}$ is greater than 9? Leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$.
We sample $n = 300$ iid data according to a Uniform(0,6) distribution. Derive an approximate expression for $P(2.8 \leq \bar{X} \leq 3.2)$; leave the answer in terms of $\Phi(\cdot)$ and/or $\Phi^{-1}(\cdot)$.
We sample $n = 100$ iid data according to an unknown distribution with mean $\mu = 10$ and variance $\sigma^2 = 4$. Write down the (approximate) probability that the sum of the data is less than $(n+1)\mu$. Leave the answer in terms of an appropriate (inverse) cdf symbol. (Indicate the number of degrees of freedom if applicable.)
We are given 100 iid data sampled according to a distribution with mean $\mu = 10$ and known variance $\sigma^2 = 64$. (a) Estimate $P(\bar{X} \leq \mu-\sigma/10)$. Employ standardization and leave the answer in terms of an appropriate (inverse) cdf function. (b) It turns out we didn’t know $\sigma^2$ after all. The sample variance is $S^2 = 25$. We wish to determine $P(\bar{X} \geq \mu+1)$. Update the calculation in part (a) to take these changes into account.
We are given $n$ iid data sampled from a normal distribution with unknown mean $\mu$ and unknown variance $\sigma^2$. The observed sample mean is $\bar{x}_{\rm obs}$ and the observed sample standard deviation is $s_{\rm obs}^2$. We want to compute a two-sided 90 percent confidence interval for $\sigma^2$, but we cannot do this by hand. (a) When writing out R code, what specific (inverse) cdf function would we use? (b) Assume we have the variables mu, xbar.obs, sigma2, s2.obs, n, and q. What is the first argument for the function you named in part (a)? (Use * for multiplication and / for division.) (c) In the call to uniroot(), should the interval argument be c(-1000,1000), c(-1000,-0.001), or c(0.001,1000)?
We collect $n$ iid data sampled according to a normal distribution with mean $\mu$ and variance $\sigma^2$, and we wish to perform a hypothesis test for the population variance, with the null and alternative hypotheses being $H_o: \sigma^2 = \sigma_o^2 = 10$ and $H_a: \sigma^2 < \sigma_o^2$. The level of the test is $\alpha$. (a) Define an appropriate test statistic $Y$ (under the null), give the name of its sampling distribution, and give the parameter value(s) for that distribution. (b) Write down the boundary of the rejection region as a function of $\alpha$. Use a(n inverse) cdf symbol, and indicate the number of degrees of freedom if appropriate. (c) Derive an expression for the power of the test when $\sigma^2 = \sigma_a^2$. Use a(n inverse) cdf symbol, and indicate the number of degrees of freedom if appropriate. (d) If $\sigma_a^2 > \sigma_o^2$, is the power of the test greater than, less than, or equal to $\alpha$?
We sample a single datum that is distributed according to the following probability density function: \[\begin{eqnarray*} f_X(x) = \frac{a}{x^{a+1}} \,, \end{eqnarray*}\] where $a > 1$ and $x \geq 1$. The expected value of $X$ and the cumulative distribution function for $X$ are \[\begin{eqnarray*} E[X] = \frac{a}{a-1} ~~~\mbox{and}~~~F_X(x \vert a) = 1 - \left(\frac{1}{x}\right)^a \,. \end{eqnarray*}\] We are to carry out the test $H_o : a = a_o = 2$ versus $H_a : a > a_o$, assuming $\alpha = 0.05$. The datum we observe has the value $x_{\rm obs} = 5/4$. (a) Write down an expression for the $p$-value. Evaluate this to a single number, by hand. (b) State a conclusion: “reject” or “fail to reject.” (c) Write down an expression for the rejection-region boundary. (d) Write down an expression that shows the test power for $a = 4$.
We have collected $n$ data pairs $(x_i,Y_i)$ and we have learned a simple linear regression model, with estimated slope $\hat{\beta}_1$. We now transform all of the $x_i$’s; the new predictors are $x_i' = ax_i$, where $a$ is a numerical constant. We relearn the simple linear regression model; its slope is $\hat{\beta}_1'$. (a) Write down the relationship between $\hat{\beta}_1'$ and $\hat{\beta}_1$. (b) Write down the relationship between the variance of $\hat{\beta}_1'$ and the variance of $\hat{\beta}_1$. (c) Write down the relationship between the test statistic $T'$ for $H_o: \beta_1' = 0$ versus $H_a: \beta_1' \neq 0$, and the test statistic $T$ for $H_o: \beta_1 = 0$ versus $H_a: \beta_1 \neq 0$.
Let $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, for $i=1,...,n$, with the $x_i$’s fixed and with $\epsilon_i \sim \mathcal{N}(0,\sigma^2)$. (a) Show that $\hat\beta_1 = (\sum_{i=1}^n x_i Y_i - n \bar{x}\bar{Y})/(\sum_{i=1}^n x_i^2 - n \bar{x}^2)$ is an unbiased estimator for $\beta_1$. (b) Do we need the assumption that the regression noise is normally distributed for this derivation?
We are given the partially redacted output from R’s lm() and shapiro.test() functions shown below. (a) Does the evidence suggest that the core assumptions of a linear model are violated here, or not violated? (b) What is the estimated slope? (c) What is the sample size $n$? (d) What is the estimate of $\sigma^2$? (e) What is the estimate of the correlation coefficient between $x$ and $Y$?

> summary(lm(Y~x))
... 
Residuals:
    Min      1Q  Median      3Q     Max
-4.4582 -1.5171  0.8756  1.4028  2.5168

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)              0.6813   4.887 3.78e-05 ***
x                        0.1189  -4.624 7.76e-05 ***
... 
Residual standard error: 1.888 on 28 degrees of freedom
         R-squared:  0.4329
... 
> shapiro.test(Y-predict(lm(Y~x)))
... 
W = 0.90706, p-value = 0.01256

Type	\(E[Y]\) Increases with \(\theta\)?	\(p\)-Value	Test Power
two-tail	yes	2 \(\times\) min[\(F_Y(y_{\rm obs} \vert \theta_o)\),	\(F_Y(y_{\rm RR,lo} \vert \theta)\) +
		\(1-F_Y(y_{\rm obs} \vert \theta_o)\)]	\(1 - F_Y(y_{\rm RR,hi} \vert \theta)\)
	no	same as above	same as above

lower-tail	yes	\(F_Y(y_{\rm obs} \vert \theta_o)\)	\(F_Y(y_{\rm RR} \vert \theta)\)
	no	\(1-F_Y(y_{\rm obs} \vert \theta_o)\)	\(1-F_Y(y_{\rm RR} \vert \theta)\)
upper-tail	yes	\(1-F_Y(y_{\rm obs} \vert \theta_o)\)	\(1-F_Y(y_{\rm RR} \vert \theta)\)
	no	\(F_Y(y_{\rm obs} \vert \theta_o)\)	\(F_Y(y_{\rm RR} \vert \theta)\)

Parameter	Estimate	Expected Value	Variance
\(\beta_0\)	\(\bar{Y}-\hat{\beta}_1\bar{x}\)	\(\beta_0\)	\(\sigma^2\left(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}\right)\)
\(\beta_1\)	\(\frac{S_{xY}}{S_{xx}}\)	\(\beta_1\)	\(\frac{\sigma^2}{S_{xx}}\)
\(\sigma^2\)	\(\frac{1}{n-2}\) \(\sum_{i=1}^n\) \(e_i^2\) \(=\frac{SSE}{n-2}\)	\(\sigma^2\)	\(\frac{2\sigma^4}{n-2}\)

\(x_1\)	\(p_{X_1}(x_1)\)	\(x_2\)	\(p_{X_2}(x_2)\)
0	0.4	0	0.2
1	0.6	1	0.8