\documentclass[10pt]{article}
\usepackage{amsfonts,amsthm,amsmath,amssymb}
\usepackage{array}
\usepackage{epsfig}
\usepackage{fullpage}
\usepackage{amssymb}
\usepackage[colorlinks = false]{hyperref}
\newcommand{\1}{\mathbbm{1}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\F}{\mathbb{F}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\eps}{\varepsilon}
\newcommand{\bern}{\text{Bern}}
\newcommand{\DTIME}{\textbf{DTIME}}
\renewcommand{\P}{\textbf{P}}
\newcommand{\SPACE}{\textbf{SPACE}}
\usepackage{latexsym}
\usepackage{booktabs}
\usepackage{soul,color}
\theoremstyle{definition}
\begin{document}
\input{preamble.tex}
\newtheorem{example}[theorem]{Example}
\theoremstyle{definition}
\newtheorem{defn}[theorem]{Definition}
%\newtheorem{remark}{Remark}[section]
\handout{CS 229r Information Theory in Computer Science}{Jan 29, 2019}{Instructor:
Madhu Sudan}{Scribe: Mirac Suzgun}{Lecture 8: A Gentle Introduction to Polar Codes}
\section{Bookkeeping}
\subsection{Outline for Today}
\begin{enumerate}
\item Overview of Polar Codes.
\item Principal Claims.
\item Encoding + Decoding.
\end{enumerate}
\subsection{Administrative Issues}
\begin{enumerate}
\item Professor Sudan will not be holding his office hours today.
\item Mitali has her usual office hours at 5.00 pm this evening.
\item Problem Set 2 is due Tuesday, February 26th.
\end{enumerate}
\section{Review: Linear Codes}
Last time, we started setting up the stage for polar codes. We wanted to perform efficient correction of errors for the Binary Symmetric Channel with parameter $p$, BSC $(p)$. We know its capacity rate, and would like to get $\epsilon$-close to capacity using efficient coding algorithms.
We talked about the \textit{divide-and-encode} technique: We basically took a large block, split it into smaller chunks, and then encoded each individual small chunk separately. Working with small chunks helps us manage running time, because it will perhaps be exponential in the size of the small blocks. However, we shall note that no matter what we do, the size of our small chunks will be $\mathcal{O} (1/\epsilon^2)$, or some polynomial in $(1/\epsilon)$, therefore our running time is exponential in parameter (i.e. $\mathcal{O} (2^{1/\varepsilon^2})$, which doesn't get us close to capacity with feasible algorithms.
We are interested in codes with efficient algorithms that take small chunks of information -- it can be as small as you want, but presumably of length at least $1/\varepsilon^2$ -- and compress these small chunks. From now on, our target theorem is the following:
\begin{theorem}
\label{thm:zero}
$\forall p \in [0, 1]$, $\exists$ polynomials $A, B$ such that $\forall \epsilon > 0$, $\exists$ code of length $n \leq A (1/\epsilon)$ that gets $\epsilon$-close to capacity with pre-processing, encoding, and encoding time $\leq B (1/\epsilon)$.
\end{theorem}
We want these codes to be short, and we would like to decode them efficiently.
When working with a linear compression scheme, the following theorem is equivalent to the previous theorem.
\begin{theorem}
\label{thm:one}
Suppose $p \in (0, \frac{1}{2})$. $\exists$ polynomials $ A, B$ such that $\forall \epsilon > 0$, $\exists n \leq A (1/\epsilon)$ and $m \leq (h (p) + \epsilon) \cdot n$ with a linear compressor $H \in \mathbb{F}^{m \times n} _2$ and an efficient decompressor $D$ such that
\begin{align}
\Pr_{\substack{{Z \sim \text{Bern} (p)^n}}} [D (HZ) \neq Z] \leq \frac{1}{n^{10}}
\end{align}
\end{theorem}
\begin{remark}
The term $\frac{1}{n^{10}}$ in Theorem~\ref{thm:one} does not have a special meaning in the equation. Changing this term will only change the polynomials $A$ and $B$.
\end{remark}
\begin{remark}
Assuming that we have such a such good (i.e. linear) compression algorithm, how can we construct a good coding algorithm? This was an open question until a decade ago. In 2008, we found a code that works. In 2013, we found a proof that this code works. And now, in 2019, we are actually able to teach in classroom.
\end{remark}
\begin{proposition}
$\forall p \in (0, \frac{1}{2})$. $\exists \delta > 0$ such that $\forall n$, $n$ bits can be compressed to length $m \leq h (p) \cdot n + \mathcal{O} (n^{1- \delta})$.
\end{proposition}
\begin{exercise}
Try to come up with a non-linear but efficient scheme that achieves $h (p) \cdot n + \mathcal{O} (n^{1/2}$).
\end{exercise}
Note that we still expect to see some loss, but it should not grow linearly in $n$.
\subsection{Polar Codes \cite{arikan2008channel}}
Let us now construct these magical codes. This idea is due to Erdal Ar{\i}kan, a Turkish information theorist. Ar{\i}kan said, let me take two bits and try to show you how to compress them efficiently. Two bits! What can we do with two bits? Remember that we can only perform a linear operation.
Suppose we have two bits $U, V$. One simple approach would be to take their (XOR) sum:
\begin{align}
(U, V) \mapsto U+V
\end{align}
But this is too ambitious; we definitely lost one bit of information. So, this will not work, unfortunately.
Let us try to add one more information into this. What information can we add? We need to add something which is different that $U+V$. We need to do something which is linear. There are only two (reasonable) options remaining, so we will pick one them and output $V$, in addition to $U+V$, that is:
\begin{align}
\label{map:polarization}
(U, V) \mapsto (U+V, V)
\end{align}
\begin{remark}
It is important to realize that this is a completely reversible operation. If we are given the pair $(U+V, V)$, we can easily determine the values of $U$ and $V$.
\end{remark}
Ar{\i}kan noticed that this process does not compress yet, but it starts to differentiate the entropies.
\begin{lemma} If $U$ and $V$ are i.i.d. random variables distributed according to $\text{Bern} (p)$, where $p \in (0, \frac{1}{2})$, then
\begin{align}
H(U+V) > H (U), H(V)
\end{align}
\end{lemma}
Suppose $U, V \overset{\text{i.i.d.}}{\sim} \text{Bern} (p)$, with $p \in (0, \frac{1}{2})$. Then, $U +V \sim \text{Bern} (p')$. In fact, if $0 < p < \frac{1}{2}$, then $0 < p < p' < \frac{1}{2}$. We can show that $p' = (2p) (1-p)$. However, let us consider a simpler case, where $U, V \in \{-1, +1\}$.
\begin{align}
U, V =
\begin{cases}
+1 & \text{with probability } 1 - p \\
-1 & \text{with probability } p \\
\end{cases}
\end{align}
Now, consider the product of these two random variables, that is $U \cdot V$.
\begin{table*}[h]
\centering
\begin{tabular}{c | c | c }
UV & $\mathbf{0}$ & $\mathbf{1}$ \\ \hline
$\mathbf{0}$ & $(1-p)^2$ & $(1-p)p$ \\
$\mathbf{1}$ & $p (1-p)$ & $p^2$
\end{tabular}
\end{table*}
\begin{exercise}
Analyzing $\E [U]$ and $\E [UV]$, show that $\frac{1}{2} > p' > p$.
\end{exercise}
Conditioning tells us $H(U, V) = H (U) + H(V)$. Therefore, we can write $ H (U + V, V)$ as follows:
\begin{align}
H (U + V, V) = H (U + V) + H (V \mid U + V)
\end{align}
\begin{figure}[h]
\centering
{\includegraphics[width=0.40\textwidth]{LocalPolarization.png}}
\caption{Local Polarization}
\label{fig:polarization}
\end{figure}
If $U, V \overset{\text{i.i.d.}}{\sim} \text{Bern} (p)$, then $H(U) = H(V) = h(p)$. On the other hand, $U + V \sim \bern (p') $, where $p' > p$ and $H (V \mid U +V) = H (U + V, V) - H (U + V) = 2 h (p) - h (p') < h (p)$.
\subsection{The Main Idea}
We would like to squeeze our conditional entries as much as possible to $0$. We want every bit of the message to have an entropy rate close to $1$ or $0$. As we will see shortly, the bits with conditional entropy close to $0$ will no longer be necessary, and at the end, we will send the bits whose conditional entropies are close to $1$.
% \section{Repeating}
Let us start with $n$ independent $\bern (p)$ bits. This is the message that we would like to compress. For the time being, let us assume that $n$ is a power of $2$, that is $n = 2^{t}$, for some $t \in \mathbb{N}$, however the algorithm works just fine even without this assumption.
\begin{figure}[h!]
\centering
{\includegraphics[width=0.70\textwidth]{Polarization.png}}
\caption{Polarization Process}
\label{fig:polarization}
\end{figure}
Let us pair these $n$ bits arbitrarily. We then get $(n/2)$ pairs of bits. For each of these ordered pairs of the form $(U, V)$, we map $(U, V)$ to $U+V$ and $V$, separately. We then group all the elements of the type $U+V$ and $V$ together, while respecting their order. Now, all the elements in the first group are i.i.d. Bernoulli random variables with parameter $p'$, where $0 < p < p' < \frac{1}{2}$. Therefore, $H(U+V) = h (p') > h(p)$. Similarly, all the elements in the second group satisfy the following condition: $H(V \mid U + V) < H (V) = h(p)$.
It should be clear to our astute readers that this process increases the conditional entropy of one group while decreasing that of the other. We therefore repeat this process until we have only single bits $W_1, W_2, \cdots, W_n$. Under this scheme, the conditional entropy of first singleton $W_1$ is very close to $1$, whereas the conditional entropy of the last singleton $W_n$ is very close to $0$. We now would like to make sure that the singletons in the middle, whose entropies are in between $0$ and $1$, are as few as possible.
\begin{claim}
Suppose $\forall j$ $H(W_0 \mid W_{ \tau)$. Then,
\begin{align}
|S| &= |\{j \mid H (W_j \mid W_{