\documentclass[10pt]{article}
\usepackage{amsfonts,amsthm,amsmath,amssymb}
\usepackage{array}
\usepackage{epsfig}
\usepackage{fullpage}
\usepackage{amssymb}
\usepackage[colorlinks = false]{hyperref}
\newcommand{\1}{\mathbbm{1}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\F}{\mathbb{F}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\eps}{\varepsilon}
\newcommand{\DTIME}{\textbf{DTIME}}
\renewcommand{\P}{\textbf{P}}
\newcommand{\SPACE}{\textbf{SPACE}}
\usepackage{tikz}
\usepackage[europeanresistors,americaninductors]{circuitikz}
\usetikzlibrary{chains}
\usetikzlibrary{decorations.pathreplacing,decorations.pathmorphing}
\begin{document}
\input{preamble.tex}
\newtheorem{example}[theorem]{Example}
\theoremstyle{definition}
\newtheorem{defn}[theorem]{Definition}
\handout{CS 229r Information Theory in Computer Science}{Feb 21, 2019}{Instructor:
Madhu Sudan}{Scribe: Dan Stefan Eniceicu}{Lecture 8}
\section{Overview}
\subsection{Outline}
\begin{enumerate}
\item Polar Coding
\item Overview
\item Principal Claims
\item Encoding, etc.
\end{enumerate}
\subsection{Administrative Things}
\begin{enumerate}
\item No office hours today
\item Mitali has usual office hours
\item Pset 2 due Tuesday
\item When you ask a question, please state your name so that Madhu gets to know you
\end{enumerate}
\section{Review}
Our goal is to perform efficient correction of errors for the binary symmetric channel BSC($p$). We know its capacity; we want to get $\epsilon$ close to capacity with efficient algorithms, and one thing we have seen is that we can take a large block, split it into smaller chunks and work with each one separately. Working with small blocks helps with running time because it will be maybe exponential in the size of the small blocks. But no matter what, this length will be $\sim O(1/\epsilon^2)$ or some polynomial in $1/\epsilon$, which when exponentiated results in $\sim O(2^{1/\epsilon^2})$. It turns out we are only interested in blocks of small length (around $1/\epsilon^2$). The target theorem from now on is:
\begin{theorem}
$\forall p\in[0,1]$, there exist polynomials $A$ and $B$ such that $\forall \epsilon>0$ there exists a code of length $n\leq A(1/\epsilon)$ that gets $\epsilon$ close to capacity, with preprocessing, encoding, decoding time all $\leq B(1/\epsilon)$.
\end{theorem}
Reminder: These codes need to be short and we want to decode them very efficiently. This theorem is our target. Last time we said that even though we were talking about error correction, we should actually be talking about compression with a linear mechanism, so we are interested in a linear compression scheme. The following is a theorem which is equivalent to the other one when working with a linear compression scheme:\\
\begin{theorem}
$\forall p\in(0,1/2)$, there exist polynomials $A$, $B$ such that $\forall\epsilon>0$, there exist $n\leq A(1/\epsilon)$, $m\leq(h(p)+\epsilon)n$, a matrix $H\in \textbf{F}_2^{m\times n}$, and a decompressor $D$ such that
$$\mbox{Pr}_{Z\sim\mbox{Bern}(p)^n}[D(HZ)\not=Z]\leq1/n^{10}.$$
\end{theorem}
Notes:
\begin{enumerate}
\item
The term $1/n^{10}$ is just a constant. Changing this constant will likely change $A$ and $B$.
\item
Linearity over $\textbf{F}_2$ is xor sum between bit subsets.
\item
Theorem 2 implies Theorem 1, so we will prove Theorem 2.
\item
In 2008 (spoiler: https://arxiv.org/abs/0807.3917) a code satisfying this was found and in 2013 it was proved that said code works. Now we can teach it. :)
\end{enumerate}
We can attempt to work with an alternative claim:
\begin{claim}
$\forall p\in(0,1/2)$ there exists $\delta >0$ such that $m\leq h(p)+n+O(n^{1-\delta})$.
\end{claim}
We basically want subpolynomial behavior in $n$.
\begin{exercise}
Try to come up with a nonlinear (efficient) scheme that achieves for example $h(p)n+O(n^{0.51})$.
\end{exercise}
This shows us we should expect to see some loss (which can be sublinear in $n$). This is our target.
\section{Polar Codes}
\subsection{Compressing 2 bits}
Arikan's remarkable idea was to look at 2 bits and attempt to compress them linearly. So we have two bits, $u$, $v$ which we can try to compress. What can we do with them? One idea is to take their xor sum, $u+v$. This leads to information loss. It is too ambitious, so we need one more bit. Let's then map $(u,v)\rightarrow(u+v,v)$. This is invertible, so obviously not a good compression scheme. What Arikan noticed however was that $$H(u,v)=H(u+v,v)$$
due to invertibility, but at the same time, we have
$$H(u+v)>H(u),H(v).$$
To prove this last claim, let $u,v\sim$ Bern$(p)$ i.i.d. Then, $u+v\sim$ Bern$(p')$. For $0p'>p$.
\end{exercise}
Now, we have
$$H(u,v)=H(u+v,v)\mbox{ and }H(u+v)>H(u),H(v).$$
By the chain rule,
$$H(u,v)=H(u)+H(v),$$
since $u$ and $v$ are independent, and
$$H(u+v,v)=H(u+v)+H(v|u+v).$$
Thus,
$$H(v|u+v)h(p)$, and the elements of the bottom group will satisfy
$$H(v|u+v) < H(v)=h(p).$$
Since the elements of the top group, apart from $u+v$, are independent of $v$, we can instead condition on all bits in the top group, leading to $H(v|\mbox{top group bits})