Tuesday, December 10, 2013

TF-IDF for Document Clustering

In this post we discuss a standard way to encode a text document as a vector using a term frequency-inverse document frequency (tf-idf) score for each word, with an aim to cluster similar documents in a corpus.

Thursday, December 5, 2013

The Continuum Hypothesis and Weird Probabilities

I recently heard of an interesting "proof" that $(0,1)$ does not have cardinality $\aleph_1$. This would disprove the Continuum Hypothesis ($\textbf{CH}$), which asserts that any subset of $(0,1)$ is either countable or has the same cardinality as $(0,1)$. More precisely, $\textbf{CH}$ states that if $\omega_0$ denotes the first countable ordinal (i.e., the set of natural numbers) and $\omega_1$ denotes the first uncountable ordinal, then $| \omega_1| = | 2^{\omega_0}|$. Here $2^{\omega_0}$ is the set of all binary-valued functions $f \colon \omega_0 \to \{ 0, 1\}$, which has the same cardinality as the power set of $\omega_0$ and as $(0,1)$.

Tuesday, November 12, 2013

Simple Divisibility Tests

In this post we will justify some common divisibility tests that most school children are familiar with. Recall that a number such as $111$ is divisible by $3$ if and only if the sum of the digits is divisible by $3$. Since $1+1+1=3$ is divisible by $3$, we see that $111$ is divisible by three.

More generally, an integer $N$ is divisible by $3$ if and only if the sum of the digits appearing in $N$ is divisible by $3$.

Saturday, November 2, 2013

A Simple Proof that the Harmonic Series Diverges

One usually encounters the harmonic series \[ \sum_{k=1}^{\infty} \frac{1}{k} \] as an example of a series that diverges for non-obvious reasons. With $H_n$ defined to be the partial sum $H_n:=\sum_{k=1}^n \frac{1}{k}$, a typical way to prove divergence is to observe that \begin{align*} H_{2n} &= H_n + \frac{1}{n+1} + \dots + \frac{1}{2n} \\ & \geq H_n + \underbrace{\frac{1}{2n} + \dots + \frac{1}{2n}}_{n \text{ terms}} \\ & = H_n + \frac{1}{2}. \end{align*} In particular, $H_{2^n} \geq 1 + \frac{n}{2}$, so the sequence of partial sums of the harmonic series has an unbounded subsequence, whence the series diverges.

Friday, October 25, 2013

Flickr Tags are Useless

Recently I began to code up an image classifier, and my hope was to use Flickr images as a source of real data.  The Flickr API allows you to easily obtain the most recent 100 images that have a given tag.  However, I found that the correspondence between tags and content to be marginal.  About 75% of the images I downloaded tagged with "chicken" were pictures of a ferret (presumably named Chicken?) and more than half of the images tagged with "kitten" were of puppies.  Images tagged with "sushi" included dresses, kids playing the violin, ducks, and an assortment of other objects with absolutely no relation to sushi.

Tagged with "sushi"

Tuesday, October 15, 2013

Sample Covariance Matrix

Suppose $X_1, \dots, X_p$ are random variables and $X:=(X_1, \dots, X_p)$ is the random vector of said random variables. Given a sample $\mathbf x_1, \dots \mathbf x_N$ of $X$, how do we obtain an estimate for the covariance matrix $\text{Cov}(X)$?

To this end, let $\mathbf X=( \mathbf X_{ij})$ denote the $N \times p$ data matrix whose $k$th row is $\mathbf x_k$ and let $\bar x_i$ denote the sample mean of $X_i$. That is, $\bar x_i$ is the mean of the $i$th column of $\mathbf X$. Then the sample covariance matrix $Q = (Q_{ij})$ is defined by \[ Q_{ij} := \frac{1}{N-1} \sum_{k=1}^{N} (\mathbf X_{ki} - \bar x_i)(\mathbf X_{kj} - \bar x_j). \] We can write this a bit more compactly if we introduce yet more notation. Let $\bar{\mathbf x}:=(\bar x_1, \dots, \bar x_p)$ denote the sample mean vector and $M$ the $N \times p$ matrix with all rows equal to $\bar{\mathbf x}$. The matrix $N:=\mathbf X - M$ is the data matrix that has been centered about the sample means, and we quickly see that \[ Q = N^T N. \]

Friday, October 11, 2013

US Life Expectancy Data

I took a look at the World Health Organization's data on the at birth life expectancies for various countries in the world.  The number given for each year assumes that levels of mortality will remain roughly the same throughout the individual's life, and it is a summary of the mortality patterns across all individuals in a given population.

From the WHO's global data I excised the data relating specifically to the United States.  After plotting the data and fitting a regression line, it's pretty clear that so far the US life expectancy exhibits a linear rate of increase, growing from about 69 years in the early 1960's to around 79 years today.

Linear Models

Suppose we think that a random variable $Y$ (the response) is linearly dependent on random variables $X_1, \dots, X_p$, where $p$ is some integer. We can model this by assuming that \[ Y =\underbrace{ \beta_0 + \sum_{j=1}^p \beta_j X_j}_{\hat Y} + \epsilon, \] where $\beta = (\beta_0, \dots, \beta_p)$ is the parameter vector for said model and $\epsilon$ is a distribution for the residual error between the true value of $Y$ and the linear prediction $\hat Y$. We assume that $\epsilon$ has mean $0$. Letting $\mathbf x$ denote the random vector $(1,X_1, \dots, X_p)$, this model can be rewritten \[ Y = \mathbf x \cdot \beta + \epsilon, \quad \text{or} \quad y(\mathbf x ) = \mathbf x \cdot \beta + \epsilon, \] where $- \cdot -$ represents the standard inner product on $\mathbb R^{p+1}$. One often assumes, as we do now, that $\epsilon$ has a Gaussian distribution. The probability density function $p$ for $Y$ then becomes \[ p(y \mid \mathbf x, \beta) = \mathcal N(y \mid \mu (\mathbf x), \sigma^2 (\mathbf x)) \] where $\mu(\mathbf x) = \mathbf x \cdot \beta$. We will assume that the variance $\sigma^2(\mathbf x)$ is a constant $\sigma^2$. Suppose we make the $N$ observations that $Y = y^{(i)}$ when $X_j = x^{(i)}_j$ for $j=1, \dots, p$ and $i = 1, \dots, N$. Let $\mathbf x^{(i)}$ denote the $i$th feature vector. Further, let $\mathbf y:=(y^{(1)}, \dots, y^{(N)})$ denote the vector of responses. For mathematical convenience we assume that each feature vector $\mathbf x^{(i)}$ is a $p+1$-vector with first entry equal to $1$. Finally, let $\mathbf X$ denote the $N \times (p+1)$ feature matrix whose $i$th row is just the feature vector $\mathbf x^{(i)}$. Given the data above, how can we find $\beta$ such that $\hat Y$ is best fit, in some sense? One standard way to do this is to minimize the residual sum square error \[ \text{RSS}(\beta) := \| \mathbf y - \mathbf X \beta \|^2 =(\mathbf y - \mathbf X \beta ) \cdot ( \mathbf y - \mathbf X \beta ) = \sum_{i=1}^N ( y^{(i)} - \mathbf x^{(i)} \cdot \beta)^2. \] Here $\mathbf X \beta$ denotes the usual matrix multiplication. We can find all minimums in the usual analytical way by computing the partials of $\text{RSS}(\beta)$ with respect to the variables $\beta_j$ for $j=0, \dots, p$. Letting $\mathbf X_j$ denote the $j$th column of $\mathbf X$, we see that \begin{align*} \frac{\partial \text{RSS}}{\partial \beta_j} & = 2 \sum_{i=1}^N ( y^{(i)} - \beta_j x^{(i)}_j) x^{(i)}_j \\ &= 2 (\mathbf y - \beta_j \mathbf X_j) \cdot \mathbf X_j \end{align*} and so $\text{RSS}(\beta)$ is minimized when \[ \mathbf X^T ( \mathbf y - \mathbf X \beta) = 0. \] In the case that $\mathbf X^T \mathbf X$ is not singular the unique solution is \[ \hat \beta = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y. \]

Thursday, September 19, 2013

The Monty Hall Problem via Bayes' Rule

The Monty Hall Problem is a puzzle presented in the context of a gameshow. Behind one of three doors is a prize, and behind the other two doors are goats. Here we assume that the prize is not a goat. You pick a door at random. The game's host, Monty Hall, knows which door the prize is behind. He opens one of the remaining two doors to reveal a goat and asks you if you'd like to switch your choice to the other unopened door. Should you switch?

Thursday, September 12, 2013

Monday, August 19, 2013

Counting Inversions in Python

Below is an O(n log(n)) algorithm for counting the number of inversions in an array of distinct positive integers. We call the index pair (i,j) of an array A an inversion provided that i < j and A[i] > A[j]. The idea is to break the array A into a left half L:=A[:n] and a right half R:=A[:n] and then count the number of inversions in L, in R, and the number of split inversions (i,j) where i is less than n and j is greater than or equal to n. If L and R are already sorted then we can easily count the number of split inversions via a slight augmentation to the merge subroutine of merge-sort.

def SortCount(A):
   l = len(A)
   if l > 1:
      n = l//2
      C = A[:n]
      D = A[n:]
      C, c = SortCount(A[:n])
      D, d = SortCount(A[n:])
      B, b = MergeCount(C,D)
      return B, b+c+d
      return A, 0

def MergeCount(A,B):
   count = 0
   M = []
   while A and B:
      if A[0] <= B[0]: 
         count += len(A)
   M  += A + B     
   return M, count 

Tuesday, August 6, 2013

A Coin Tossing Game

As an example of the law of total probabilities, I'm going to explore a slightly generalized variant of an interview question I recently encountered. The basic scenario is to imagine two players, let's say Amanda and Bill, both of who have a weighted coin. They take turns tossing their coin and whoever obtains the first heads wins. Amanda's coin comes up heads with probability $\theta$ and Bill's with probability $\psi$. If Amanda tosses first, what's the probability she will win?

Saturday, July 27, 2013

A natural isomorphism between a vector space and its double dual.

Soon after one begins a study of abstract vector spaces, they are confronted with the fact that a finite dimensional vector space is naturally (or canonically) isomorphic to its double dual space. Intuitively, this means that the isomorphism is somehow independent of any particular choice of basis. This notion is made rigorous with the notion of natural transformations of functors between categories.

Monday, June 24, 2013

Densities and Expectations

For the layperson, it's probably most helpful to think of the density function $f$ associated to a random variable $X$ as the function you integrate to compute probabilities. Similarly, the expected value $E(X)$ is thought of as the average value that $X$ takes. Often $E(X)$ is defined already in terms of the density function, and it's not clear to the beginner why $\int_{\mathbb R} x f(x) \ dx$ should compute the expected value of $X$. What should be perhaps a bit more obvious is that if you integrate $X$ over the entire probability space, with respect to the given probability measure, then you obtain the average value of $X$. This is indeed how the expectation of $X$ is typically defined in a more analytical setting.

Sunday, June 9, 2013

Complex Numbers as Matrices

Many people take exception with the complex number field $\mathbb C$ because the equality \begin{equation}\label{eq:neg1} i^2 = -1 \end{equation} rankles them in some way. I suspect that the layperson's distaste for this identity is their wont to interpret $i$ as a real number, since they have the most experience with the arithmetic of real numbers. Surely our convention of calling numbers such as $\sqrt 2$, $\pi$, and $e$ real but $-2i$ ``purely imaginary'' doesn't help.

Identities of the form in \eqref{eq:neg1} abound in the wild. For instance, within the ring $M_2(\mathbb R)$ of $2 \times 2$-matrices with real entries, the rotation by $90^{\circ}$ matrix $R$ defined by \begin{equation}\label{eq:matrix} R := \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix} \end{equation} satisfies $R^2 = -\mathbf{1}$. Here $\mathbf{1}$ signifies the identity matrix \[ \mathbf{1}:=\begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \] which is the multiplicative identity in $M_2(\mathbb R)$.

Saturday, March 30, 2013

Mathematical Intro to Data Science

I just came across Yuan Yao's lecture notes/book titled A Mathematical Introduction to Data Science.  Unfortunately, the last few chapters are presently empty.

Friday, March 29, 2013

Non-linear Additive Maps

While teaching an introductory linear algebra course, a colleague noticed that most of the examples of additive maps he gave turned out to be linear.  He asked whether I could think of a map which was additive, but not linear.  In a general context, the question was to find a ring $R$, $R$-modules $ V$ and $ W$, and a map $ f \colon V \to W$ such that $ f $ is a group homomorphism, but not an $ R$-module morphism, i.e, $ f(x+y)=f(x)+f(y)$ for all $ x,y \in V,$ but there is some $ r \in R$ and $ z \in V$ such that $ f(r \cdot z) \neq r \cdot f(z)$.

Symmetries of the Naturals

This was a little problem I thought about while backpacking a few years ago.

We define a symmetry of the natural numbers to be any bijective function $ f \colon \mathbb{N} \to \mathbb{N}$.  Let $ S$ denote the set of symmetries of $ \mathbb N$.

Claim.  $ S$ has the cardinality of the continuum.

Sums of Consecutive Integers

A fun fact I recently learned about is the following.

Claim . A natural number can be written as the sum of at least two consecutive positive integers if and only if it is not a power of 2.

All Horses Are the Same Color

I was fortunate to TA UCSC's introductory proofs course three times.  One of the exercises I like to pose to my students is the following fallacious induction proof that all horses are the same color.

The "Proof"

As a base case, it's clear that any horse is the same color as itself.  Assume as an inductive hypothesis that any collection of n horses are the same color, where n is some fixed natural number.  Given any collection of n+1 horses, we may select a subset consisting of all but a single horse.  By the inductive hypothesis, all the horses in this subset are the same color.  Now regroup the horses and select another n horse subset, but this time we pick the horse that left out of the first selection.  Again all the horses in this subset are the same color, so all the horses in the original set are the same color.

Code Snippets

Presently, I am using Google's Code-Prettify to include code snippets in my posts.  Getting this up and running was fairly painless.

1.  Navigate to Design>Template>Edit Html.
2.  After the <head> tag insert <script defer="defer" src=""> </script>. 3. Within the <body> insert onload='prettyPrint()'. The tag should now look similar to: <body expr:class='"loading" + data:blog.mobileClass' onload='prettyPrint()'>
4.  Consult the Google pages linked to above to see how to employ code-prettify.

One must be careful to encode the snippet so that it is HTML ready.

Thursday, March 28, 2013


A few days ago I heard about MathJax, "an open source JavaScript display engine for mathematics that works in all modern browsers."  As this is a blog dealing with mathematics, I require that my host supports some form of mathematical markup, preferably LaTeX. allows basic LaTeX, but its engine renders images, which do not scale well on mobile devices.  MathJax renders the markup in CSS and Web fonts, which scale nicely even in a mobile setting.