Suppose we think that a random variable $Y$ (the response) is linearly dependent on
random variables $X_1, \dots, X_p$, where $p$ is some integer.
We can model this by assuming that
\[
Y =\underbrace{ \beta_0 + \sum_{j=1}^p \beta_j X_j}_{\hat Y} + \epsilon,
\]
where $\beta = (\beta_0, \dots, \beta_p)$ is the parameter vector for said model and $\epsilon$ is
a distribution for the residual error between the true value of $Y$ and the linear prediction
$\hat Y$. We assume that $\epsilon$ has mean $0$.
Letting $\mathbf x$ denote the random vector $(1,X_1, \dots, X_p)$, this model can be rewritten
\[
Y = \mathbf x \cdot \beta + \epsilon,
\quad \text{or} \quad
y(\mathbf x ) = \mathbf x \cdot \beta + \epsilon,
\]
where $- \cdot -$ represents the standard inner product on $\mathbb R^{p+1}$.
One often assumes, as we do now, that $\epsilon$ has a Gaussian distribution. The probability
density function $p$ for $Y$ then becomes
\[
p(y \mid \mathbf x, \beta) = \mathcal N(y \mid \mu (\mathbf x), \sigma^2 (\mathbf x))
\]
where $\mu(\mathbf x) = \mathbf x \cdot \beta$. We will assume that
the variance $\sigma^2(\mathbf x)$ is a constant $\sigma^2$.
Suppose we make the $N$ observations that $Y = y^{(i)}$ when
$X_j = x^{(i)}_j$ for $j=1, \dots, p$ and $i = 1, \dots, N$. Let $\mathbf x^{(i)}$
denote the $i$th feature vector. Further, let $\mathbf y:=(y^{(1)}, \dots, y^{(N)})$
denote the vector of responses. For mathematical convenience we assume that
each feature vector $\mathbf x^{(i)}$ is a $p+1$-vector with first entry equal to $1$.
Finally, let $\mathbf X$ denote the $N \times (p+1)$ feature matrix whose $i$th row
is just the feature vector $\mathbf x^{(i)}$.
Given the data above, how can we find $\beta$ such that $\hat Y$ is best fit, in some sense?
One standard way to do this is to minimize the residual sum square error
\[
\text{RSS}(\beta) := \| \mathbf y - \mathbf X \beta \|^2
=(\mathbf y - \mathbf X \beta ) \cdot ( \mathbf y - \mathbf X \beta )
= \sum_{i=1}^N ( y^{(i)} - \mathbf x^{(i)} \cdot \beta)^2.
\]
Here $\mathbf X \beta$ denotes the usual matrix multiplication. We can find
all minimums in the usual analytical way by computing the partials of $\text{RSS}(\beta)$
with respect to the variables $\beta_j$ for $j=0, \dots, p$. Letting $\mathbf X_j$ denote
the $j$th column of $\mathbf X$, we see that
\begin{align*}
\frac{\partial \text{RSS}}{\partial \beta_j} & = 2 \sum_{i=1}^N ( y^{(i)} - \beta_j x^{(i)}_j) x^{(i)}_j \\
&= 2 (\mathbf y - \beta_j \mathbf X_j) \cdot \mathbf X_j
\end{align*}
and so $\text{RSS}(\beta)$ is minimized when
\[
\mathbf X^T ( \mathbf y - \mathbf X \beta) = 0.
\]
In the case that $\mathbf X^T \mathbf X$ is not singular the unique solution is
\[
\hat \beta = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y.
\]

Assuming that the observed data is independently and identically distributed,
the log-likelihood function for the given data $ \mathcal D := \{ \mathbf y, \mathbf X \}$
is
\begin{align*}\label{eq:likelihood}
\ell (\beta) &:= \ln p (\mathcal D \mid \beta) \\
&= \sum_{i=1}^N \ln p(y^{(i)} \mid x^{(i)} , \beta) \\
&= \sum_{i=1}^{N} \ln \left[ \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{1}{2 \sigma^2} (y^{(i)} - \mathbf x^{(i)} \cdot \beta)^2 \right ) \right] .
\end{align*}
It is easy to see that $\hat \beta$ is a maximum likelihood estimate for the data, i.e., $\hat \beta$ satisfies
\[
\hat \beta = \arg \max_{\beta} \ell (\beta).
\]
We can in turn of course suppose that the data $\mathcal D$ varies and consider
the maximum likelihood estimator $\hat \beta( \mathcal D)$, which is a random variable itself.
The variance-covariance matrix for the maximum likelihood estimator $\hat \beta$ is:
\[
\text{Var}(\hat \beta) = (\mathbf X^T \mathbf X)^{-1} \sigma^2.
\]
We can estimate $\sigma^2$ by
\[
\hat \sigma^2 = \frac{1}{N-p-1} \sum_{i=1}^N (y^{(i)} - \hat y^{(i)})^2,
\]
where $ \hat y^{(i)} := \mathbf x^{(i)} \cdot \hat \beta$ is the response prediction of the best fit linear model.
This is an unbiased estimator for $\sigma^2$, which means that the expectation
$E (\hat \sigma^2) = \sigma^2$.