\documentclass[10pt]{article}
\usepackage{amsfonts,amsthm,amsmath,amssymb}
\usepackage{array}
\usepackage{parskip}
\usepackage{epsfig}
\usepackage{fullpage}
\usepackage{graphicx} %package to manage images
\graphicspath{ {images/} }
\begin{document}
\input{preamble.tex}
\renewcommand{\binset}{\bbF_2}
\handout{CS 229r Essential Coding Theory, Lecture 15}{March 21, 2017}{Instructor: Madhu Sudan}{Scribes: Christina Ilvento}{Lecture 15: Polar Codes}
%Hamming Codes, Distance, Examples, Limits, and Algorithms}
\subsection*{Administrivia}
Office hours for the rest of the semester
\begin{itemize}
\item Tuesdays: 2:30-3:30
\item Thursdays: 5-6 (but cancelled this week)
\end{itemize}
\section*{Today: Polar Codes}
Until now, we've been covering primarily standard material in coding theory, and for the rest of the class we're going to cover more recent stuff (within the last 10 years).
Today: Polar Codes
\begin{itemize}
\item Motivation
\item Idea
\item Some initial proofs
\end{itemize}
Next time:
\begin{itemize}
\item Complete proofs
\end{itemize}
\subsection*{Motivation}
We want to deal well with random errors: efficient encoding and decoding achieving capacity. Polar Codes, we'll see, are the ultimate answer to how well can we deal with random errors.
\subsubsection*{The Challenge} Consider the binary symmetric channel with probability $p$, $BSC(p)$,
\[BSC_p(x) = \begin{cases}
x &\text{with } \Pr=p\\
\bar{x} &\text{with }\Pr=1-p
\end{cases}\]
Given an input bit $x \in \{0,1\}$, the output bit $y=x$ with probability $p$, $\bar{x}$ with probability $1-p$. We only deal with $p<0.5$, as with $p=0.5$ we learn nothing.
We know from Shannon, we can achieve Rate $R < 1-H(p)$, the catch is that if we look at \textit{constructive} results and we want to achieve rate $R= 1 - H(p) - \epsilon$, so far all of our results have required decoding time $\geq 2^{1/\epsilon^2}$. The challenge: Can we do better? Yes, with Polar Codes.\footnote{There may be other codes, but Polar Codes seem to be the cleanest.}
\subsubsection*{Codes Thus Far}
Let's revisit the codes we've seen so far in the course:
\begin{enumerate}
\item Existential Results: less interesting, because they don't have efficient algorithms
\item Algebraic codes: algorithms are poly-time, but they don't work great over small alphabets
\item Graph-theoretic codes: also good for worst case errors, have linear time algorithms, but haven't yet gotten us to capacity
\item Information Theoretic codes (today): depart a bit from the above and rely more heavily on information-theoretic concepts like conditional entropy
\end{enumerate}
\subsection*{Polar Codes: Basic Construction Ideas}
The first step in understanding the construction of polar codes is to understand that if we can get a good compression algorithm, we can get a good coding algorithm. That is, linear-time compression implies linear time decoding.% (Linear COmpression $\righarrow$ Good coding algorithm). If the decompression is efficient, then we get efficient decoding.
\subsubsection*{Compression to Decoding}
First, recall the definition of a Bernoulli Random Variable
\[Bern(p) \triangleq x=\begin{cases}0\text{ with } \Pr= p\\
1 \text{ with }\Pr =1-p\end{cases}\]
Take an $x$ which is an $n$ bit Bernoulli random variable
$x \in Bern(p)^n$ and a function $f: \{0,1\}^n \rightarrow \{0,1\}^m$ for $m < n$ and $f^{-1}: \{0,1\}^m\rightarrow\{0,1\}^n$ where $m\approx H(p)*n$.
%Want $f$ is linear, and $\Pr_{x\sim Bern(p)^n}[f^{-1}(f(x)\neq x]\rightarrow 0$
\textbf{Claim: }
If $f$ is linear, and $\Pr_{x\sim Bern(p)^n}[f^{-1}(f(x)\neq x]\rightarrow 0$, then we have an efficient encoding and decoding algorithm. Why?
\begin{itemize}
\item $f$ is linear implies $f(x) = xH$ for some matrix $H$. We'll take $H$ will be the parity check matrix of the code we're building
\item Take $c$ such that $cH=0$ to be our codewords. The receiver gets $z = c + x$ where $x \sim Bern(p)^n$ and $f^{-1}((c+x)H) = f^{-1}(xH) = f^{-1}(f(x))$ which with high probability $=x$ by our assumption. So by applying $f^{-1}$, a linear function, we compute our error vector which we simply subtract from the received word $z$ to recover $c$.
\item So the decoding algorithm based on this is to just compute $z - f^{-1}(f(z))$.
\end{itemize}
%So now we are just thinking of compressing and decompressing, but the key thing is that $f$ is linear.
As a side note, we could also pick $H$ at random, but we won't know how to compute $f^{-1}$ quickly; a key assumption in the above was that we had $f$ and $f^{-1}$ which are linear.
\subsubsection*{Constructing H}
The new idea in Arikan's work is to build $H$ very carefully so that it meets these requirements.
To construct $H$ we'll construct a larger matrix $P$ which will act to concentrate entropy in certain parts of our output.
\[\begin{bmatrix} & & x& & \end{bmatrix}\begin{bmatrix}& & & & & \\& & & & \\& & &P& & \\& & & & \\ \\ \end{bmatrix}=\begin{bmatrix} xP_L &|&xP_R \end{bmatrix}\]
Where $x$ is our $n$ bit input, $P$ is a square $n\times n$ matrix, and $xP_L$ is the left-hand $m$ bits of the result.
We'll pick $P$ to be invertible, and we'll insist that $H((xP)_R|(xP)_L)$, the entropy of the righthand side of the result conditioned on the lefthand side of the result, is small or about $=o(1)$. The amount of uncertainty about the right hand side, once you know about the left is very very small, so $xP_R$ will usually be determined by $xP_L$. If we can construct such a $P$, then we just take the $m$ left columns of of $P$ to be $H$.
\subsection*{Information Theory Background}
Before we get to the proofs, we'll need to cover some background on information theory.
\textbf{The entropy of a single variable}: Take $z = (p_1,...,p_n)$ where $p_i \triangleq Pr[z=i]$ and $H(z) \triangleq\sum p_i\log_2(\frac{1}{p_i})$. If entropy of $z$ is going to 0, then $z$ is almost completely determined. If $H(z)=k$, then $z$ has roughly $2^k$ possible values.
\textbf{Joint Distributions}
We'll often be interested in joint distributions, that is the pair of variables
$(x,y)$ with $\Pr[x=i,y=j]=p_{ij}$, which can be expressed in the same way as the above.
\textbf{Conditional entropy}
The entropy of $X$ conditioned on $Y$ is written as:
\[H(X|Y) = \sum_j \Pr[Y=j]H(X|Y=j)\]
There are several key rules and expressions we'll come back to:
\begin{itemize}
\item \textbf{Chain rule:} $H(X,Y) = H(Y) + H(X|Y)$
\item Conditional entropy is always less than or equal to unconditional entropy in expectation: $H(X|Y) \leq H(X)$ - for particular quantities it may increase, but in expectation it does not. To see this, consider the joint distribution for $X$ and $Y$, \[X=\begin{cases}1 &\text{if }Y=0\\ \text{uniform}(n) &\text{if }Y=1 \end{cases}\]
\item The "entropy triangle inequality"
$H(X,Y) \leq H(X) + H(Y)$
\item $H(X) = \alpha$ for some $\alpha < 1$, then $\Pr[X=mode(X)] \geq 1-\alpha$ Follows from the fact that $H(p) \geq p$ for $p < 1/2$.
\item $H(X)=n \rightarrow supp(X)\geq 2^n$; the only way you get high entropy is if the support is large.
\item If we just relabel the domain, the uncertainty/entropy doesn't change. Consider $x \sim \Delta([n])$, ($x$ drawn from a distribution on $1,...,n$). If we take a permutation $\pi$ and apply it to $x$ to generate $y$, we clearly have that $H(x) = H(y)$. \textbf{Simple Exercise:} Verify this is the case for any bijection.
\end{itemize}
Entropy is a way of managing uncertainty, and we are interested in uncertainty because we want to start with a high entropy value $x$, do some linear operations (apply $P$), shove the uncertainty towards the left and get the right hand side to be completely determined given the left hand side.
So our whole goal is to move the uncertainty from the right-hand side of the output to the left.
\subsection*{Constructing P}
Before we get too far, recall that $P$ is invertible, which means that the entropy of $xP$ is the same as the entropy of $x$. We also have that $H(x)=nH(p)$ when $x\sim Bern(p)^n$. So we must have $H(p)n$ bits in the L side of our result, so $m \approx H(p)*n$. Now we'll show how to construct $P$ to meet these requirements.
We'll use a simple iterative process to build up $P$, by building smaller sub-matrices first.
Each sub-matrix will polarize the entropy of its outputs. To see how this works, start by looking at $x_1,x_2$, independent Bernoulli bits and convert to a new two-bit sequence where the entropy is non-uniform. That is, we want:
\begin{itemize}
\item $H(x_1,x_2) = H(y_1,y_2)$
\item $H(y_2|y_1) < H(x_2)$ which implies
\item $H(y_1) > H(x_1)$
\end{itemize}
How to do this? Take $y_1 = x_1 \oplus x_2$ and $y_2 = x_2$. So given $x_1\oplus x_2$, $x_2$ has low entropy.\footnote{We'll see the full proof of this next time, but recall that our Bernoulli random variables have $p < \frac{1}{2}$ and work out a few small examples to build intuition.}
For example, let's look at the first sub-matrix $P_1$ on two bits:
\[P_0 = \begin{bmatrix}1 &1\\ 1 & 0\end{bmatrix}\]
\[\begin{bmatrix}x_1 & x_2\end{bmatrix}\begin{bmatrix}1 &1\\ 1 & 0\end{bmatrix} = \begin{bmatrix}x_1\oplus x_2 & x_2\end{bmatrix} = \begin{bmatrix}y_1 & y_2\end{bmatrix}\]
We'll then do this process recursively, building up larger and larger transformations until our final matrix of $P_{2^\ell}$.% \times 2^\ell$.%start with $2^{l-1}$ and $2^{l-1}$ bits and (see picture)
%\[P_i = \begin{bmatrix}P_{i-1} &P_{i-1}\\ P_{i-1} & 0}\end{bmatrix}\]
At each step, $x*P_n = (u*P_{n/2} , v*P_{n/2})$, where $u = (x_1 \oplus x_{n/2 + 1}, ..., x_{n/2} \oplus x_n)$ and $v = (x_{n/2 + 1} ... x_n)$.
Which will give us the following pictorial representation:\footnote{Note, in class, we showed an interleaved version of this picture, which is technically correct, but a bit confusing to reason about. The benefit of the non-interleaved version is that each bit is conditioned on all of the bits above it in the diagram.}
\includegraphics[width=0.5\textwidth]{CT1}
At any given $\oplus$ node, we have the following "Z" relation:
\includegraphics[width=0.5\textwidth]{CT2}
Which let's us reason about the conditional probability of each output in relation to only the values above it. As we have that $D$ and $B$ are independent, $H(A|B) = H(A|B,D)$ and likewise $H(C|D) = H(C|D,B)$, so we can relate the sum of the entropy of the left nodes of the $Z$ with the sum of the (conditional) entropies of the right nodes. We do rely on the fact that at every node, we are polarizing on entropically identical bits. We'll see a more complete proof of the entropy polarization in the next lecture, so for now, just state informally that
%So now we have a way to get from $2^l$ to $2^l$ bits which has more entropy in the top than the bottom. But we want *almost all* of the entropy in the left and the right to still be reasonably large.
%We know our process is linear and invertible, but will it get all of the entropy into one side?
%\textbf{Informally: } Want to say that
the conditional entropies are getting polarized, being driven to 0 or 1.
\subsubsection*{Partitioning $P$ into $H$ and $C$}
Now, we'll complete our construction of $H$ by reasoning about the number of entries in $P$ which have very high or very low entropy.
Consider our inputs: $x_1...x_n$ mapped to $y_1...y_n$ where $y_1...y_n=xP$. Take $\eta_i \triangleq H(y_i |y_1....y_{i-1})$
\textbf{Claim:} As $N \rightarrow \infty$, where $N=2^\ell$,
$\#\{i | \eta_i \in (\frac{1}{N^2},1- \frac{1}{N^2})\}=o(N)$.
That is, as $N$ tends to infinity, the the entropy of each bit is polarized to either side of the interval $[\frac{1}{N^2}, 1-\frac{1}{N^2}]$.% are on either side of the interval.
Now, let's define sets of columns in $P$ based on their entropy:
\[A = \{i | \eta_i \geq 1 - 1/N^2\}\]
\[B = \{i | \eta_i \leq 1/N^2\}\]
\[C = \{i | \eta_i \in (1/N^2,1 - 1/N^2\}\]
$A$ is the set of bits whose entropy is high - they aren't well specified by the bits above them. $B$ is the set of bits whose entropy is low, they are well specified by the bits above them, and $C$ is the set that falls in the interval (not high or low). We claim that $|C| = o(N)$, (proof in next lecture). We know that $|A| \leq H(p)N$, so $|B| \geq (1 - H(p)-o(1))N$.
\subsubsection*{Completing the Construction}
Now, we have all the ingredients we need for the construction of $H$. First, we'll permute the columns of $P$ so that $A$ is the left most, $C$ is the next and $B$ is the last.\footnote{Note, we do not show explicitly how to identify $A$, $B$ and $C$ in sub-exponential time, see Tal and Vardy for this.} We argue that this increases the entropy for the bits in $A$, as they are conditioned on fewer events, and decreases the entropy for $B$ as they are conditioned on more events. Namely, $H(xB|xA, xC) \leq \frac{1}{N^2}|B|$. This follows from the chain rule and monotonicity under conditioning. \textbf{Exercise}: argue this formally given our assumptions.
%Mapping $x\rightarrow x(A|C)$, ($(A|C)=H$ remember), we have that $H(xB | xH) \leq \delta = O(1/N)$.
So now we want to guess $xB$ given $xH$ (which is only $xA$) so we can invert the matrix. To guess
$xB$ given $xH$, we first
define $q_a \triangleq \Pr[xH = a]$ and $aq_{b|a} \triangleq \Pr[xB=b | xH=a]$. Given that $H(xB|xH)=\delta$, by a simple Markov inequality we obtain $\Pr_{xH=a}[H(xB|xH=a)>\sqrt{\delta}] \leq \sqrt{\delta}$. When the entropy is $<\sqrt{\delta}$ we have that \[\Pr[xB = mode(xB | xH=a)] \geq 1 - \sqrt{\delta}\] So there exists a function $f^{-1}:\{0,1\}^m\rightarrow\{0,1\}^n$ such that with high probability $1-d\sqrt{\delta}$, $f^{-1}(f(x))=x$.
\textbf{Overview of the decoding algorithm} We'll discuss the details of the decoding algorithm again next lecture, but the broad strokes are: We're starting with a collection of bits and they were all 1 with probability $p$ and 0 with probability $1-p$. We made combinations of the bits and produced new variables. We marked the resulting elements as being part of $A$ or $C$ or $B$. For each bit we compute the probability that it takes 1 or 0 given all of the bits above. Given the fact that the matrix is polarized, we'll get very determined probabilities, and we'll do a deterministic rounding. In the next lecture, we'll show how we can compute the probability for each bit.
%%%%%%%%
\section{Bibliographic Notes}
There are two core works to consider for Polar Codes. The initial work of Arikan, which lays out the proposal, and a second work of Guruswami and Xia which shows decoding in time poly$(n/\epsilon)$.
Citations?
\bibliographystyle{alpha}
\bibliography{bib}
\end{document}