\documentclass[11pt]{article}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{epsfig}
\usepackage{graphicx}
%\usepackage{mathabx}
\usepackage[small,nohug,heads=vee]{diagrams}
\diagramstyle[labelstyle=\scriptstyle]
\usepackage[fleqn,tbtags]{mathtools}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf 6.440: Essential Coding Theory } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
%\define
\newcommand{\eps}{\epsilon}
%\newcommand{\spn}{\text{span}}
\newcommand*{\pr}[1]{\mathop{Pr}\{#1\}}
%\DeclareMathOperator{\Pr}{Pr}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{note}[theorem]{Note}
\newtheorem{notation}[theorem]{Notation}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
%\renewcommand{\baselinestretch}{1.25}
\begin{document}
\lecture{2 --- February 11, 2013}{Spring 2013}{Prof.\ Madhu Sudan}{Mina Karzand}
\section{Advertisement}
The course 6.441 "Information Theory" is offered this spring by Prof. Yury Polyanskiy.
\section{Plan}
Today we will talk about Shannon's 1948 paper "A mathematical theory of communications" and
\begin{itemize}
\item Coding theorems (source coding and channel coding)
\item Converse theorems
\end{itemize}
\section{Shannon's paper}
Shannon in his 1948 paper had a new perspective on errors that might happen in communications. He also tried to come up with an idea of how \emph{reliable communications} should look like and it should be mathematically modeled. To do so, he studied two extreme cases of communication channels:
\begin{itemize}
\item \emph{Noiseless Channel}: In which the goal of the communication is to transmit the message with the smallest use of resources (bits) as possible. The basic idea behind the source coding is \emph{compression}.
\item \emph{Noisy Channel}: In which the during the transmission of the message, the output of the channel is a noisy version of the input. The goal of Channel coding is to introduce some redundant representation of the message such that reliable communication is possible even in the presence of noise.
\end{itemize}
He introduced two intermediate objects: \emph{the encoder} at the transmission side of the communications and \emph{the decoder} at the receiver side of the communications help achieve the above goals.
At the time that Shannon wrote his paper, the computational power to perform the encoding and specially decoding in a tractable way did not exist. Instead, he claimed that these are mathematically well-defined objects.
\section{Noisy Channel Theorem}
\begin{diagram}
\fbox{\text{Source}} &\rTo^{m} & \fbox{\text{Encoder}} &\rTo^{X=E(m)} & \fbox{\text{Noisy Channel}} &\rTo^{y} & \fbox{\text{Decoder}}&\rTo^{\hat{m}=D(y)} &\fbox{\text{Receiver}} \\
\end{diagram}
The channel is modeled as an object whose output is a noisy version of the input. The probabilistic nature of the relationship between the output of the channel, $y$ and input of the channel, $x$ is modeled as $P(y|X)$.
\emph{Discrete memoryless channels} are defined to be the channels this relationship between the input and output for each letter is independent of the time and the past history of the communication.
\emph{Binary Symmetric Channels, BSC($p$)} are the discrete memoryless channels with binary input and output alphabet in which the input is flipped with probability $p$ at the output.
\begin{diagram}
0 &\rTo^{1-p}& 0\\
&\rdTo^{p}\ruTo^{p}&\\
1 &\rTo_{1-p}&1\\
\end{diagram}
Shannon claimed that for any discrete memoryless channel (or more generally, ergodic channels), there exist limiting rat of communication, \emph{capacity}, which he found exactly.
Capacity for BSC($p$) is $C_{\text{BSC}(p)}=1-H(p)$ where
\[H(p)\triangleq -p \log{p} - (1-p) \log{(1-p)} \]
\section{Informal Statement of the noisy channel coding theorem:}
For any communication channel, there is a sharp threshold called capacity $C$.
If you transmit information at a rate $RC$, then $\Pr\{\hat{m} = m\}\leq \exp(-n)$.
As described, if the rate of the communications is lower than the capacity, the larger the block length, the probability of error at the receiver is smaller. The cost of achieving this smaller probability of error with longer blocks is that the encoding and especially the decoding would be heavier computationally and the delay to extract the message in each block is also longer.
If the rate of the communications is above the capacity, there is nothing that can be done at the encoder and the decoder to have reliable communications and small probability of error.
\section{Informal Statement of Noiseless Channel Coding Theorem:}
Having a source of information which generates a sequence of bits as its output and each bit is $0$ with probability $1-p$ and $1$ with probability $p$, we want to represent this sequence with as few bits as possible in a noiseless channel.
Suppose that we run this source $n$ times. We expect to see about $pn$ 1's in the generated sequence and $(1-p)n$ 0's.
One way to represent the message (the generated sequence) is to have a two phase communication. In the first phase the number of $1$'s in the sequence is transmitted. In the second phase of the communications, the string with the given number o $1$'s is identified.
\begin{enumerate}
\item Send $k=$ number of $1$'s in the msg.
\item Use $\lceil \log_{2}{n \choose k}\rceil$ more bits to identify which of the strings of length $n$ with $k$ 1's is the message.
\end{enumerate}
\begin{theorem}(Chernoff Bound)
In a sequence of length $n$, where each element is $0$ with probability $1-p$ and $1$ with probability $p$, if $k$ is the number of $1$'s in the sequence, thus
\[\pr{k \notin [(p-\eps)n,(p+\eps)n]}\leq \exp(-n)\]
\end{theorem}
The generated sequence from the specified source with $k$ 1's in it, could be in one of these two cases:
\begin{itemize}
\item $k\in [(p-\eps)n,(p+\eps)n]$: According to the Chernoff bound, this happens with probability greater than $1-\exp(-n)$. We want to know how $n \choose k$ looks like in this case. For simplicity, let's assume $k=pn$. Thus,
\[{n \choose pn} \approx \frac{n^n}{(pn)^{pn} [(1-p)n]^{(1-p)n}}=
\left(\frac{1}{p^p (1-p)^{1-p}}\right)^n\]
\[\log{n \choose pn}= n\left[ -p\log{p}-(1-p)\log{(1-p)} \right]= n H(p)\]
Thus if the message falls into this category (which happens with probabilty $> 1-\exp(-n)$), the number of bits required to transmit the message in noiseless channel is $\approx nH(p)(1\pm \eps')$ where $\eps' \to 0$ as $\eps \to 0$.
\item $k\notin [(p-\eps)n,(p+\eps)n]$: This event happens with probability $< \exp(-n)$ and in this case, the number of bits required to transmit the message $\approx \log{n \choose k}\leq n$
\end{itemize}
Thus, the expected length of the block of bits required to transmit the message generated from this source is $\leq nH(p)(1\pm \eps')+n \exp(-n)\leq nH(p)(1\pm \eps'')$.
The converse part of the coding theorem for noiseless channel says that you can not do better than this, meaning that you can not transmit the message reliably with high probability with smaller number of bits.
The basic idea for proving the converse is that for each $k\in [(p-\eps)n,(p+\eps)n]$, there are $n \choose k$ equiprobable strings wuth the same number of 1's. Using some pigeon-hole theorem argument, it is claimed that at least $\log{n \choose k}\approx nH(p)$ bits are required to distinguish these strings.
The interesting thing in this process is that in the coding procedure, it is not necessary to know the actual value of $p$, even though We used this value in the analysis of the optimal rate and expected block length. This implies that coding (compression) algorithm can be built oblivious to the parameters of the source or the distribution of the messages it generates. This is gonna lead to the idea of universal source coding such as Limpel-Ziv-Welch coding technique.
\section{Formal Statement of Noisy channel Coding theorem}
The capacity of binary symmetric channel with probability of error of $p$ is $C_{BSC(p)}=1-H(p)$.
In the case that $p=1/2$, it is intuitive that $C=0$, meaning that no information can be transmitted reliably. Each input symbol is flipped or not with equal probability. This means that the output is independent of the input.
\begin{diagram}
0 &\rTo^{1/2}& 0\\
&\rdTo^{1/2}\ruTo^{1/2}&\\
1 &\rTo_{1/2}&1\\
\end{diagram}
In binary symmetric channels with probability of error $p=1/2-\eps$ for very small $\eps $, we can approximate $C=\Theta(\eps^2)$. Thus theory says that even though each input bit is flipped with a probability close to $1/2$, still the capacity is nonzero and reliable communication is possible at a nonzero rate.
Shannon suggested that the coding scheme that could be used to achieve this rate is random coding. The codewords for each message should be chosen completely at random. He used probabilistic method to analyze the prove the achievability results for this coding.
\subsection{Encoding of BSC($p$)}
Assume for any $\eps>0$ and $R=1-H(p)+\eps$ and $k=Rn$. The encoder performs the function $E:\{0,1\}^k \to \{0,1\}^n$ at random. The optimal decoder performs maximum \emph{liklihood decoding}. Given a received string ($y\in \{0,1\}^n$), the ML decoder sees how far this string is far all possible transmitted signals and chooses the closes one as the estimate of the transmitted message.
For our proof, a simpler decoder would suffice. Given the received string $y$, our proposed decoder generates the set $S_y=\{\tilde{m}| \delta(E(\tilde{m}),y)<(p+\eps/2)n \}$. If $|S_y|=1$, then the estimated message is $\hat{m}\in S_y$. Unless it declares an error in the transmission. We can prove that the probability of error for a source with equiprobable message with completely random encoder and the specified decoder can be made arbitrarily small. It is also assumed that the decoder has complete knowledge of how the encoder function.
\begin{diagram}
\fbox{\text{Source}} &\rTo^{m} & \fbox{\text{Encoder}} &\rTo^{X=E(m)} & \fbox{\text{Noisy Channel}} &\rTo^{y} & \fbox{\text{Decoder}}&\rTo^{\hat{m}=D(y)} &\fbox{\text{Receiver}} \\
& & & &\uTo^{\eta}& & & \\
\end{diagram}
The proof of the theorem uses the following lemma.
\begin{lemma}
$\forall \eps>0$, assuming uniform distribution over the messages, completely random encoder and $\eta$ being the error sequence in the BSC($p$):
\[\Pr_{E,m,\eta}\{D(E(m)+\eta)\neq \hat{m}\}\leq \exp(-n)\]
\end{lemma}
In the decoding process, there are three types of events that could result in an error in the decoding process.
\begin{enumerate}
\item $|S_y|=0$
\item $|S_y|\geq 2$
\item $m \notin S_y$
\end{enumerate}
The proof of the theorem consists of three parts:
\begin{itemize}
\item[(L1)] We use Chernoff bound and the fact that the expected number of 1's in the error sequence ($\eta$) is $pn$ to claim
\[\pr{m \notin S_y}\leq \exp(-n)\]
\item[(L2a)] For given received sequence $y$ and fixed $\hat{m}\neq m$,
\begin{eqnarray*}
\Pr_{E(\hat(m))}\{\hat(m)\in S_y\} & \leq & \frac{\sum_{i=0}^{(p+\eps/2)n}{n \choose i}}{2^n}\\
& \approx & \frac{2^{nH(p)(1+\eps')}}{2^n}\\
& \approx & 2^{-n[1-H(p)+\eps'']}
\end{eqnarray*}
\item[(L2)] Union bound is used to prove
\begin{eqnarray*}
\pr{\exists \hat{m}\neq m, \hat{m}\in S_y} &\leq& 2^k 2^{-n[1-H(p)+\eps']}\\
& = & 2^{n[1-H(p)-\eps]}2^{-n[1-H(p)+\eps'']}\\
& \to & 2^{-\eps''' n}
\end{eqnarray*}
\end{itemize}
\begin{theorem}
$\forall p\in (0,1/2), \forall \eps>0, \exists \delta>0, n_0<\infty $ s.t. $\forall n>n_0$
There exist $E:\{0,1\}^k \to \{0,1\}^n$ where $k\geq n(1-H(p)-\eps)$ and $D:\{0,1\}^n \to \{0,1\}^k$ such that
\[\Pr_{m,\eta\in BSC(p)}\{D(E(m)+\eta)\neq m\}\leq 2^{-\delta n}\]
\end{theorem}
Note that the ball of radius $pn$ around any point has size $\approx 2^{nH(p)}$.
\begin{theorem}[Converse]
$\forall p, \forall \eps>0, \exists\delta>0, n_0<\infty$ s.t. $\forall n\geq n_0$ and $\forall E:\{0,1\}^k\to\{0,1\}^n ,D:\{0,1\}^n\to \{0,1\}^k$ where $k=\lfloor (1-H(p)+\eps)n\rfloor$
\[\Pr_{m,\eta}\{D(E(m)+\eta)=m\}\leq 2^{-\delta n}\]
\end{theorem}
\begin{diagram}
&\rTo^{m} & \fbox{\text{Encoder}} &\rTo^{X=E(m)} &\fbox{+}&\rTo^{y=E(m)+\eta}& \fbox{\text{Noiseless Channel}} &\rTo^{y} & \fbox{\text{Decoder}}&\rTo^{\hat{m}=D(y)}_{\hat{\eta}=y-E(\hat{m})} \\
& & & &\uTo^{\eta}& & & \\
\end{diagram}
If we can recover $\hat{m}=m$, then $\hat{\eta}=\eta$ too. Using the source coding theorem, we know that to transmit $\eta$ in the noiseless channel we need $H(p)$ bits/symbol. Thus, if we transmit information at a rate $R>1-H(p)$, then we are expecting the noiseless channel to transmit information at rate above $1$ bit/symbol which is impossible.
The combinatorics way of looking at the converse theorem:
We model the input-output relationship by using a bipartite graph.
There are $2^k$ messages which is the number of nodes at the right side of the graph.
The nodes at the left side of the graph correspond to the received sequences. Having transmitted a message if a typical error happens, a subset of the output sequences could be received. A typical error has roughly $pn$ number of $1$'s in it. Thus, assuming only thinking about the typical errors, the degree of the nodes on the left side of the graph is $D=2^{nH(p)}$, because there are roughly ${n \choose np}\approx 2^{nH(p)}$ sequences which correspond to the typical errors. The decoding procedure hopes to be able to recover the message using the noisy received signal. Thus, if $D2^k\gg 2^n$, the decoding procedure fails. Decoding correctly is possible if the degree of the nodes at the right side of the graph is usually 1. Average degree of the nodes at the right side of the graph is $D2^{k-n}$.
\[\Pr\{\text{a random edge lands on vertex of degree}< \tau \cdot 2^{k-n}\}<\tau\]
\end{document}