Friday, October 25, 2013

Flickr Tags are Useless

Recently I began to code up an image classifier, and my hope was to use Flickr images as a source of real data.  The Flickr API allows you to easily obtain the most recent 100 images that have a given tag.  However, I found that the correspondence between tags and content to be marginal.  About 75% of the images I downloaded tagged with "chicken" were pictures of a ferret (presumably named Chicken?) and more than half of the images tagged with "kitten" were of puppies.  Images tagged with "sushi" included dresses, kids playing the violin, ducks, and an assortment of other objects with absolutely no relation to sushi.

 Tagged with "sushi"

Tuesday, October 15, 2013

Sample Covariance Matrix

Suppose $X_1, \dots, X_p$ are random variables and $X:=(X_1, \dots, X_p)$ is the random vector of said random variables. Given a sample $\mathbf x_1, \dots \mathbf x_N$ of $X$, how do we obtain an estimate for the covariance matrix $\text{Cov}(X)$?

To this end, let $\mathbf X=( \mathbf X_{ij})$ denote the $N \times p$ data matrix whose $k$th row is $\mathbf x_k$ and let $\bar x_i$ denote the sample mean of $X_i$. That is, $\bar x_i$ is the mean of the $i$th column of $\mathbf X$. Then the sample covariance matrix $Q = (Q_{ij})$ is defined by $Q_{ij} := \frac{1}{N-1} \sum_{k=1}^{N} (\mathbf X_{ki} - \bar x_i)(\mathbf X_{kj} - \bar x_j).$ We can write this a bit more compactly if we introduce yet more notation. Let $\bar{\mathbf x}:=(\bar x_1, \dots, \bar x_p)$ denote the sample mean vector and $M$ the $N \times p$ matrix with all rows equal to $\bar{\mathbf x}$. The matrix $N:=\mathbf X - M$ is the data matrix that has been centered about the sample means, and we quickly see that $Q = N^T N.$

Friday, October 11, 2013

US Life Expectancy Data

I took a look at the World Health Organization's data on the at birth life expectancies for various countries in the world.  The number given for each year assumes that levels of mortality will remain roughly the same throughout the individual's life, and it is a summary of the mortality patterns across all individuals in a given population.

From the WHO's global data I excised the data relating specifically to the United States.  After plotting the data and fitting a regression line, it's pretty clear that so far the US life expectancy exhibits a linear rate of increase, growing from about 69 years in the early 1960's to around 79 years today.

Linear Models

Suppose we think that a random variable $Y$ (the response) is linearly dependent on random variables $X_1, \dots, X_p$, where $p$ is some integer. We can model this by assuming that $Y =\underbrace{ \beta_0 + \sum_{j=1}^p \beta_j X_j}_{\hat Y} + \epsilon,$ where $\beta = (\beta_0, \dots, \beta_p)$ is the parameter vector for said model and $\epsilon$ is a distribution for the residual error between the true value of $Y$ and the linear prediction $\hat Y$. We assume that $\epsilon$ has mean $0$. Letting $\mathbf x$ denote the random vector $(1,X_1, \dots, X_p)$, this model can be rewritten $Y = \mathbf x \cdot \beta + \epsilon, \quad \text{or} \quad y(\mathbf x ) = \mathbf x \cdot \beta + \epsilon,$ where $- \cdot -$ represents the standard inner product on $\mathbb R^{p+1}$. One often assumes, as we do now, that $\epsilon$ has a Gaussian distribution. The probability density function $p$ for $Y$ then becomes $p(y \mid \mathbf x, \beta) = \mathcal N(y \mid \mu (\mathbf x), \sigma^2 (\mathbf x))$ where $\mu(\mathbf x) = \mathbf x \cdot \beta$. We will assume that the variance $\sigma^2(\mathbf x)$ is a constant $\sigma^2$. Suppose we make the $N$ observations that $Y = y^{(i)}$ when $X_j = x^{(i)}_j$ for $j=1, \dots, p$ and $i = 1, \dots, N$. Let $\mathbf x^{(i)}$ denote the $i$th feature vector. Further, let $\mathbf y:=(y^{(1)}, \dots, y^{(N)})$ denote the vector of responses. For mathematical convenience we assume that each feature vector $\mathbf x^{(i)}$ is a $p+1$-vector with first entry equal to $1$. Finally, let $\mathbf X$ denote the $N \times (p+1)$ feature matrix whose $i$th row is just the feature vector $\mathbf x^{(i)}$. Given the data above, how can we find $\beta$ such that $\hat Y$ is best fit, in some sense? One standard way to do this is to minimize the residual sum square error $\text{RSS}(\beta) := \| \mathbf y - \mathbf X \beta \|^2 =(\mathbf y - \mathbf X \beta ) \cdot ( \mathbf y - \mathbf X \beta ) = \sum_{i=1}^N ( y^{(i)} - \mathbf x^{(i)} \cdot \beta)^2.$ Here $\mathbf X \beta$ denotes the usual matrix multiplication. We can find all minimums in the usual analytical way by computing the partials of $\text{RSS}(\beta)$ with respect to the variables $\beta_j$ for $j=0, \dots, p$. Letting $\mathbf X_j$ denote the $j$th column of $\mathbf X$, we see that \begin{align*} \frac{\partial \text{RSS}}{\partial \beta_j} & = 2 \sum_{i=1}^N ( y^{(i)} - \beta_j x^{(i)}_j) x^{(i)}_j \\ &= 2 (\mathbf y - \beta_j \mathbf X_j) \cdot \mathbf X_j \end{align*} and so $\text{RSS}(\beta)$ is minimized when $\mathbf X^T ( \mathbf y - \mathbf X \beta) = 0.$ In the case that $\mathbf X^T \mathbf X$ is not singular the unique solution is $\hat \beta = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y.$