Friday, October 11, 2013

Linear Models

Suppose we think that a random variable $Y$ (the response) is linearly dependent on random variables $X_1, \dots, X_p$, where $p$ is some integer. We can model this by assuming that \[ Y =\underbrace{ \beta_0 + \sum_{j=1}^p \beta_j X_j}_{\hat Y} + \epsilon, \] where $\beta = (\beta_0, \dots, \beta_p)$ is the parameter vector for said model and $\epsilon$ is a distribution for the residual error between the true value of $Y$ and the linear prediction $\hat Y$. We assume that $\epsilon$ has mean $0$. Letting $\mathbf x$ denote the random vector $(1,X_1, \dots, X_p)$, this model can be rewritten \[ Y = \mathbf x \cdot \beta + \epsilon, \quad \text{or} \quad y(\mathbf x ) = \mathbf x \cdot \beta + \epsilon, \] where $- \cdot -$ represents the standard inner product on $\mathbb R^{p+1}$. One often assumes, as we do now, that $\epsilon$ has a Gaussian distribution. The probability density function $p$ for $Y$ then becomes \[ p(y \mid \mathbf x, \beta) = \mathcal N(y \mid \mu (\mathbf x), \sigma^2 (\mathbf x)) \] where $\mu(\mathbf x) = \mathbf x \cdot \beta$. We will assume that the variance $\sigma^2(\mathbf x)$ is a constant $\sigma^2$. Suppose we make the $N$ observations that $Y = y^{(i)}$ when $X_j = x^{(i)}_j$ for $j=1, \dots, p$ and $i = 1, \dots, N$. Let $\mathbf x^{(i)}$ denote the $i$th feature vector. Further, let $\mathbf y:=(y^{(1)}, \dots, y^{(N)})$ denote the vector of responses. For mathematical convenience we assume that each feature vector $\mathbf x^{(i)}$ is a $p+1$-vector with first entry equal to $1$. Finally, let $\mathbf X$ denote the $N \times (p+1)$ feature matrix whose $i$th row is just the feature vector $\mathbf x^{(i)}$. Given the data above, how can we find $\beta$ such that $\hat Y$ is best fit, in some sense? One standard way to do this is to minimize the residual sum square error \[ \text{RSS}(\beta) := \| \mathbf y - \mathbf X \beta \|^2 =(\mathbf y - \mathbf X \beta ) \cdot ( \mathbf y - \mathbf X \beta ) = \sum_{i=1}^N ( y^{(i)} - \mathbf x^{(i)} \cdot \beta)^2. \] Here $\mathbf X \beta$ denotes the usual matrix multiplication. We can find all minimums in the usual analytical way by computing the partials of $\text{RSS}(\beta)$ with respect to the variables $\beta_j$ for $j=0, \dots, p$. Letting $\mathbf X_j$ denote the $j$th column of $\mathbf X$, we see that \begin{align*} \frac{\partial \text{RSS}}{\partial \beta_j} & = 2 \sum_{i=1}^N ( y^{(i)} - \beta_j x^{(i)}_j) x^{(i)}_j \\ &= 2 (\mathbf y - \beta_j \mathbf X_j) \cdot \mathbf X_j \end{align*} and so $\text{RSS}(\beta)$ is minimized when \[ \mathbf X^T ( \mathbf y - \mathbf X \beta) = 0. \] In the case that $\mathbf X^T \mathbf X$ is not singular the unique solution is \[ \hat \beta = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y. \]

Assuming that the observed data is independently and identically distributed, the log-likelihood function for the given data $ \mathcal D := \{ \mathbf y, \mathbf X \}$ is \begin{align*}\label{eq:likelihood} \ell (\beta) &:= \ln p (\mathcal D \mid \beta) \\ &= \sum_{i=1}^N \ln p(y^{(i)} \mid x^{(i)} , \beta) \\ &= \sum_{i=1}^{N} \ln \left[ \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{1}{2 \sigma^2} (y^{(i)} - \mathbf x^{(i)} \cdot \beta)^2 \right ) \right] . \end{align*} It is easy to see that $\hat \beta$ is a maximum likelihood estimate for the data, i.e., $\hat \beta$ satisfies \[ \hat \beta = \arg \max_{\beta} \ell (\beta). \] We can in turn of course suppose that the data $\mathcal D$ varies and consider the maximum likelihood estimator $\hat \beta( \mathcal D)$, which is a random variable itself. The variance-covariance matrix for the maximum likelihood estimator $\hat \beta$ is: \[ \text{Var}(\hat \beta) = (\mathbf X^T \mathbf X)^{-1} \sigma^2. \] We can estimate $\sigma^2$ by \[ \hat \sigma^2 = \frac{1}{N-p-1} \sum_{i=1}^N (y^{(i)} - \hat y^{(i)})^2, \] where $ \hat y^{(i)} := \mathbf x^{(i)} \cdot \hat \beta$ is the response prediction of the best fit linear model. This is an unbiased estimator for $\sigma^2$, which means that the expectation $E (\hat \sigma^2) = \sigma^2$.