Tuesday, October 15, 2013

Sample Covariance Matrix

Suppose $X_1, \dots, X_p$ are random variables and $X:=(X_1, \dots, X_p)$ is the random vector of said random variables. Given a sample $\mathbf x_1, \dots \mathbf x_N$ of $X$, how do we obtain an estimate for the covariance matrix $\text{Cov}(X)$?

To this end, let $\mathbf X=( \mathbf X_{ij})$ denote the $N \times p$ data matrix whose $k$th row is $\mathbf x_k$ and let $\bar x_i$ denote the sample mean of $X_i$. That is, $\bar x_i$ is the mean of the $i$th column of $\mathbf X$. Then the sample covariance matrix $Q = (Q_{ij})$ is defined by \[ Q_{ij} := \frac{1}{N-1} \sum_{k=1}^{N} (\mathbf X_{ki} - \bar x_i)(\mathbf X_{kj} - \bar x_j). \] We can write this a bit more compactly if we introduce yet more notation. Let $\bar{\mathbf x}:=(\bar x_1, \dots, \bar x_p)$ denote the sample mean vector and $M$ the $N \times p$ matrix with all rows equal to $\bar{\mathbf x}$. The matrix $N:=\mathbf X - M$ is the data matrix that has been centered about the sample means, and we quickly see that \[ Q = N^T N. \]