Tuesday, December 10, 2013

TF-IDF for Document Clustering

In this post we discuss a standard way to encode a text document as a vector using a term frequency-inverse document frequency (tf-idf) score for each word, with an aim to cluster similar documents in a corpus.

Thursday, December 5, 2013

The Continuum Hypothesis and Weird Probabilities

I recently heard of an interesting "proof" that $(0,1)$ does not have cardinality $\aleph_1$. This would disprove the Continuum Hypothesis ($\textbf{CH}$), which asserts that any subset of $(0,1)$ is either countable or has the same cardinality as $(0,1)$. More precisely, $\textbf{CH}$ states that if $\omega_0$ denotes the first countable ordinal (i.e., the set of natural numbers) and $\omega_1$ denotes the first uncountable ordinal, then $| \omega_1| = | 2^{\omega_0}|$. Here $2^{\omega_0}$ is the set of all binary-valued functions $f \colon \omega_0 \to \{ 0, 1\}$, which has the same cardinality as the power set of $\omega_0$ and as $(0,1)$.