\documentclass[10pt]{article}
\usepackage{amsfonts,amsthm,amsmath,amssymb}
\usepackage{forest}
\usepackage{array}
\usepackage{epsfig}
\usepackage{fullpage}
\usepackage{amssymb}
\usepackage[colorlinks = false]{hyperref}
\newcommand{\1}{\mathbbm{1}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\F}{\mathbb{F}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\eps}{\varepsilon}
\newcommand{\DTIME}{\textbf{DTIME}}
\renewcommand{\P}{\textbf{P}}
\newcommand{\SPACE}{\textbf{SPACE}}
\newenvironment{interlude}{\noindent{\bf Interlude}\hspace*{1em}}{\bigskip}
\begin{document}
\input{preamble.tex}
\newtheorem{example}[theorem]{Example}
\theoremstyle{definition}
\newtheorem{defn}[theorem]{Definition}
\handout{CS 229r Information Theory in Computer Science}{Feb 7, 2019}{Instructor:
Madhu Sudan}{Scribe: Daniel Chiu}{Lecture 4}
\section{Miscellaneous}
\subsection{Schedule for today}
\begin{itemize}
\item Single Shot Compression
\item Universal Compression
\item Markovian Sources
\end{itemize}
\subsection{Logistics}
Problemset 1 is due at 8pm on Friday 2/8. You can use a total of 3 late days over the semester, but only at most 2 can be used on a single problemset.
\section{Single Shot Compression}
Up to now, we've been thinking of compression as measuring something. For instance, entropy is a measurement; information is a measurement. Now, we will think of compression as a problem - e.g. "you have a file, you want to compress it [map it to a more transmittable form] right away".
\begin{defn}[Single Shot Compression]
In the \textit{Single Shot Compression} problem, there are two parties, a sender and a receiver. They both know some distribution $P = (p_1, \cdots, p_m)$ over the possible inputs to the encoding. Sender additionally knows some $X \sim P$, and wants to transmit it using the encoder $E$. The encoder $E: [m] \to \{0,1\}^*$ should give rise to a prefix-free encoding, which implies the existence of a decoder $D: \{0,1\}^* \to [m] \cup \{?\}$ (where $?$ denotes an unknown input). The goal is to minimize the expected length, over the distribution $P$, of the encoding:
\begin{equation}
\min_E \{ \mathbb{E}_{X \sim P} [|E(x)|] \}
\end{equation}
\end{defn}
It turns out that there exists an optimal algorithm for Single Shot Compression - giving an encoder that minimizes the expected length of the encoding for any distribution $P$. This is the \textbf{Huffman Encoding}.
How do we bring entropy into this? \textbf{Shannon Encoding} solves the problem using at most $H(X) + 1$ bits. Note that entropy by definition tells us that we need at least $H(X)$ bits - this is the \textbf{Shannon Lower Bound} (which we've already seen), so Shannon Encoding is within 1 bit of the [a priori] best possible solution.
\subsection{Huffman Coding}
To understand Huffman coding, we first describe the \textbf{Encoding Tree}.
\begin{example}
Suppose we have the mapping
\begin{align*}
A &\to 0 \\
B &\to 1011 \\
C &\to 100 \\
D &\to 1010 \\
E &\to 11
\end{align*}
We can represent this as a binary tree, where a 0 representing taking a left edge, and 1 a right edge. If we mark each node which is the terminus of an output of the encoder, the prefix-free condition means that for any marked node, none of its ancestors are also marked.
\begin{center}
\begin{forest}
[``"
[\textrm{0(A)}]
[1
[10
[100(C)]
[101
[1010(D)]
[1011(B)]
]
]
[11(E)]
]
]
\end{forest}
\end{center}
\end{example}
\begin{defn}[Huffman Encoding]
Given $P = (p_1, \cdots, p_m)$, the encoding function $E$ for the \textit{Huffman Encoding} is obtained by the following recursive algorithm:
\begin{enumerate}
\item If $m = 1$, encode $E(1) = ``"$ (the empty string) and return. This is the base case.
\item Sort the $p_i$. For the remaining steps, assume $p_1 \ge \cdots \ge p_m$.
\item Merge $p_m, p_{m-1}$ to get $Q = (q_1, \cdots, q_{m-1}$ where $q_i = q_i$ except $q_{m-1} = p_{m-1} + p_m$.
\item Let $E'$ be the encoding obtained recursively by encoding $Q$. Encode $1, \cdots, m-2$ as $E'$ does, but let $E(m-1) = E'(m-1) \circ 0, E(m) = E'(m-1) \circ 1$ (where $\circ$ denotes concatenation). Note that this preserves prefix-free-ness.
\end{enumerate}
\end{defn}
Here's a sketch of the proof of optimality:
\begin{proof}
Suppose $E$ is some optimal encoding of $P = (p_1, \cdots, p_m)$. Without loss of generality, $p_1 \ge \cdots \ge p_m$, and define $\ell_i = |E(i)|$ for $1 \le i \le m$.
If we didn't have $\ell_1 \le \ell_2 \le \cdots \le \ell_m$, then consider any pair $(i, j)$ such that $p_i > p_j$ but $\ell_i > \ell_j$. Swapping the two encodings for $i$ and $j$ reduces the cost by $(p_i - p_j)(\ell_i - \ell_j)$, which is positive, contradiction. Thus, modulo equal probabilities, we have that the $\ell$'s are nondecreasing, and we can swap encodings for elements with equal probability without cost to get this in general.
Consider the encoding tree of $E$. Since $l_m$ is maximal, the encoding $E(m)$ must be a leaf node $N$ of the encoding tree. Unless $m = 1$, then $N$ has a sibling $N'$, and $N'$ must be an encoding as well (why?). Thus, merge $m, m-1$ in the same way as the algorithm above does, and find the optimal tree for that string $(P_1, \cdots, P_{m-1}+P_m)$. Inductively, that tree must also be optimal, and we are basically done.
\end{proof}
\begin{exercise}
Complete and formalize the above proof.
\end{exercise}
When discussing Huffman codes, normal algorithm classes stop here. However, we'll go further to show that the expected encoding length that Huffman coding achieves is bounded by $H(x) + 1$. Note that this \textit{is} surprisingly good, because this is for Single Shot Compression, whereas entropy is defined based on the limit of encoding more and more copies of the base text.
\subsection{Shannon Encoding}
To do so, we move on to...
\begin{defn}[Shannon Encoding]
\textit{Shannon Encoding} also takes in $P = (p_1, \cdots, p_m)$. We'll say upfront that to encode $i$, we will use $|E(i)| = \ell_i = \lceil \log \frac{1}{p_i} \rceil$ bits. Since $\sum_i p_i = 1$, we have $\sum_i 2^{-\ell_i} \le \sum_i p_i \le 1$ by our definition of $\ell_i$. Thus, Kraft's inequality holds, and an encoding function exists with these $\ell_1, \cdots, \ell_m$.
\end{defn}
\begin{remark}
Note that depending on how you prove Kraft's inequality, this might be entirely nonconstructive. However, it does have the interesting property that given any one $p_i$, we can immediately determine the length of its encoding $E(i)$ without knowing the other probabilities.
\end{remark}
We can immediately analyze the performance (expected encoding length) of Shannon encoding:
\begin{align*}
\mathbb{E}_{X \sim P} [|E(x)|] = \sum_i p_i \ell_i = \sum_i p_i \lceil \log \frac{1}{p_i} \rceil \le \sum_i p_i\left(\log \frac{1}{p_i} + 1\right) = H(X) + 1
\end{align*}
\begin{remark}
Since Huffman encoding is optimal and thus at least as good as Shannon, Huffman (which is harder to analyze) achieves at most $H(X) + 1$ as well. It's quite remarkable that entropy captures optimal encoding length so well.
\end{remark}
\begin{exercise}
Is the gap of 1 between entropy and Shannon/Huffman tight? We know
\begin{equation*}
H(X) \le \text{Huffman length} \le \text{Shannon length} \le H(X) + 1
\end{equation*}
There's a total gap of 1 between $H(X)$ and $H(X) + 1$. Try to find distributions $X$ that maximize the gap between each pair of adjacent quantities (one gap at a time - maximize $\text{Huffman length} - H(X)$, and so on).
\end{exercise}
\begin{interlude}
A long time ago, people we trying to build the first fax machine, and thought about compression. To compress, they had to have a distribution of the input, and they found frequencies of small strings manually. This was the state of the art in fax machines for 20 years.
\end{interlude}
\section{Universal Compression}
Unfortunately, uses of Single Shot Compression are uncommon in the real world. For instance, if you use gzip and feed it a new file, it'll work regardless of language or of having some prior on the distribution you're feeding in. This leads to the idea of \textbf{Universal Compression} - compression that works for any distribution and any source of information.
\begin{defn}[Universal Compression]
The \textit{Universal Compression} problem takes an input string $w \in \Sigma^n$ ($\Sigma$ is the alphabet) and compresses $w$ to $\{0,1\}^*$. The result should be invertible and prefix-free. Similar to the single shot version, we can define an expected length of encoding which should be minimized.
\end{defn}
\subsection{Lempel-Ziv}
Lempel-Ziv gave an algorithm that was relatively effective empirically. There are some theorems for certain classes of probabalistic sources of $w$, which we will investigate more later in the semester. Today, we will describe the algorithm and some potential probabalistic sources.
What are we hoping for? We wish to find some repetitive structure; some self-similarity to exploit.
\begin{defn}[Lempel-Ziv]
\textit{Lempel-Ziv} compression begins by splitting the input string into encode-able pieces. Given $w$, we desire to split it into $m$ small chunks $"" = s_0, s_1, s_2, \cdots, s_m$, so that $w$ is the concatenation $s_0 \circ s_1 \circ \ldots \circ s_m$. For all $i$, let $s_i = s_{j_i} \cdot b_i$ for some $j_i < i$ and $b_i \in \Sigma$. In other words, each chunk should be a previous chunk with some extra character. Furthermore, all chunks should be unique.
\end{defn}
\begin{exercise}
The above uniquely determines the chunking. Why?
\end{exercise}
\begin{example}
\begin{align*}
w &= 0 1 0 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 \\
w &= 0|1|0 1|1 1|0 0|1 1 0|1 1 1|1 1 0 1
\end{align*}
\end{example}
\begin{defn}[Lempel-Ziv (continued)]
Finally, the encoding is simply $E(w) = PF((j_1,b_i), \cdots, (j_m, b_m))$, where the $j$'s and $b$'s are encoded in some prefix-free manner (denoted $PF$ above). Each $j$ and $b$ will encode to about $\log(n)$ bits long, so the total encoding is of length approximately $2m\log(n)$.
\end{defn}
\begin{exercise}
Find a prefix-free encoding of $\mathbb{Z}^+$ that encodes $n$ using $\log n + O(\log \log n)$ bits.
\end{exercise}
\begin{example}
Continuing Example 10, we have that the pairs $(j, b)$ are
\begin{equation*}
(0, 0), (0, 1), (1, 1), (2, 1), (1, 0), (4, 0), (4, 1), (6, 1)
\end{equation*}
\end{example}
\begin{remark}
Lempel-Ziv can often actually \textit{expand} short strings. It doesn't ``get going" until it builds up enough structure in the beginning of the string.
\end{remark}
\begin{remark}
Can we iterate compression? Generally, no, because compression schemes usually aim to be approximately uniformly distributed on their output length (which is a consequence of being of length approximately equal to the entropy).
\end{remark}
\subsection{Markovian sources}
Now, we aim to analyze the performance of Lempel-Ziv. To do so, we let the strings be drawn from some distribution $P_X$.
\begin{theorem}
If $W = w_1 \circ \ldots \circ w_n$ where each is i.i.d. sampled from $w_i \sim P_X$, then as $n \to \infty$, with high probability the length of the compression is $(H(X) + o(1))n$.
\end{theorem}
Note that another approach to this is Huffman coding - finding the sample frequency of each alphabet character, sending this distribution information so the decoder can be constructed, and then performing Huffman using this distribution. More surprisingly, Lempel-Ziv can effectively compress Markovian sources.
\begin{defn}[(Time-invariant) Markov Chain]
A sequence $Z_1, \cdots, Z_n$ is a \textit{Markov chain} if
\begin{equation*}
\forall n: Z_n | Z_1, \cdots, Z_{n-1} \sim Z_n | Z_{n-1}.
\end{equation*}
By the $\sim$ notation, we mean the conditional distributions are the same. It is additionally \textit[time-invariant] if
\begin{equation*}
\forall n, m: Z_n | Z_{n-1} \sim Z_m | Z_{m-1}.
\end{equation*}
\end{defn}
One piece of terminology - $Z_i$ is called the ``state" at time $i$. Furthermore, we can classify Markov chains where each state comes from a finite set:
\begin{defn}[$k$-state Markov chain]
Suppose that for all $i$, $Z_i \in \Gamma = \{1, \cdots, k\}$.
Then, a k\textit{-state Markov chain} is given by a $k \times k$ matrix $M$ where $M_{ij} = Pr[Z_2 = j | Z_1 = i]$. In essence, this is a finite automaton.
\end{defn}
\iffalse{
\begin{example}
Here is a 2-state Markov chain (without probabilities):
\includegraphics[width=5in]{markovchain.png}
(http://awalsh128.blogspot.com/2013/01/text-generation-using-markov-chains.html)
\end{example}
}\fi
We will only consider $k$-state Markov chains that are
\begin{enumerate}
\item Irreducible (strongly connected): there's a path from every state to every other state.
\item Aperiodic: the greatest common divisor of all cycle lengths is $1$.
\end{enumerate}
This implies the existence of the stationary distribution $\Pi$, such that $Z_i \sim \Pi \implies Z_{i+1} \sim \Pi$ if the initial distribution is $\Pi$.
\begin{defn}[Entropy of Markov chain]
We can simplify the definition of entropy using properties of the Markov chains we're considering:
\begin{align*}
H(M)
&= \lim_{n \to \infty} H(Z_n | Z_1, \cdots, Z_{n-1}) \\
&= \lim_{n \to \infty} H(Z_n | Z_{n-1}) \\
&= H(Z_2 | Z_1)
\end{align*}
\end{defn}
\begin{exercise}
Given the above, find the entropy of a $k$-state time-invariant Markov chain given the transition matrix $M$ and the stationary distribution $\Pi$.
\end{exercise}
Furthermore, we can hide the Markov chain in the background:
\begin{defn}[Hidden Markov Model]
A \textit{Hidden Markov Model} (HMM) has an underlying Markov chain $Z_1, \cdots, Z_n$. Given a distribution $P_\sigma$ for each of the possible states $\sigma \in \Gamma$, this induces a second sequence $X_1, \cdots, X_n$ drawn from the first, where $X_i \sim P_{Z_i}$. This sequence $\{X_i\}$ is the observed output of the model.
\end{defn}
It turns out that Lempel-Ziv can compress HMMs, and this is one of the nicest classes Lempel-Ziv can compress. We will see this later.
\end{document}