\documentclass[10pt]{article}
\usepackage{amsfonts,amsthm,amsmath,amssymb}
\usepackage{array}
\usepackage{epsfig}
\usepackage{fullpage}
\usepackage{amssymb}
\usepackage{float}
\usepackage[colorlinks = false]{hyperref}
\newcommand{\1}{\mathbbm{1}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\newcommand{\x}{\times}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\F}{\mathbb{F}}
\newcommand{\E}{\mathop{\mathbb{E}}}
\renewcommand{\bar}{\overline}
\renewcommand{\epsilon}{\varepsilon}
\newcommand{\eps}{\varepsilon}
\newcommand{\DTIME}{\textbf{DTIME}}
\renewcommand{\P}{\textbf{P}}
\newcommand{\SPACE}{\textbf{SPACE}}
\begin{document}
\input{preamble.tex}
\newtheorem{example}[theorem]{Example}
\theoremstyle{definition}
\newtheorem{defn}[theorem]{Definition}
\handout{CS 229r Essential Coding Theory}{April 29, 2020}{Instructor:
Madhu Sudan}{Scribe: Aditya Dhar}{Lecture 25}
\section{Outline}
\begin{itemize}
\item Coding for editing errors
\item Recap of course
\end{itemize}
\section{Codes for Editing Errors}
\subsection{Definitions} We have two strings $X \in \Sigma^n$ and $Y \in \Sigma^m$.
\defn[X$\to_{\Delta, \Gamma}Y$]{Deleting $\leq \Delta$ symbols from $X$ and inserting $\leq \Gamma$ symbols gives Y.} \\
\\
We want to design a $(\Delta, \Gamma)-$edit distance code:
\begin{itemize}
\item Unique decoding: $D(Y) = X$
\begin{itemize}
\item Encoding function, $E$, maps $\Sigma^k \to \Sigma^n$
\item Decoding function, $D$, maps $\Sigma^* \to \Sigma^k$, unlike other decoders, where we map from $\Sigma^n$.
\end{itemize}
\item List decoding: $X \in D(Y)$.
\begin{itemize}
\item Use the same encoding function, $E$, maps $\Sigma^k \to \Sigma^n$
\item Use a different decoding function, $D$, maps $\Sigma^* \to \binom{\Sigma^k}{L}$. In this case, we don't ask the decoding function to output just the message; the decoding function returns an output that includes the desired message.
\end{itemize}
\end{itemize}
\noindent $\forall X \in \Sigma^k$, we find $Y \in \Sigma^*$ such that $E(X)\to_{\Delta, \Gamma} Y$.
\noindent Thus, given a family of code $E = (E_n: \Sigma^{k_n} \to \Sigma^n)_n$, with rate $R = \lim_{n\to\infty}\frac{k_n}{n}$, we can construct a $(\delta, \gamma)$-code if $E_n$ is a $(\delta n, \gamma n)$-code $\forall n$.
\noindent Question: what values of $\delta, \gamma, R$ are achievable?
\subsection{Background}
Schulman and Zuckerman '99 first proposed an insertion / deletion editing code over $q = poly(n)$ and a rate of $R \to 1-(\delta + \gamma)$. The proof was algorithmic and proved unique decoding. For an initial thought experiment to see why rate is bounded at $1-(\delta + \gamma)$, we examine a pigeonhole principle-ish proof:
\begin{itemize}
\item Take a code of rate R. The usual pigeonhole argument says there are two codewords that agree on the first $Rn$ coordinates, and the rest of the codes are completely different strings.
\item Assume $R = 1-(\delta + \gamma)$
\item Produce an intermediate word, by copying the $Rn$ coordinates over, deleting the delta coordinates from the bottom and adding the gamma coordinates from the top. We can also insert the gamma coordinates from the bottom and then add the delta coordinates from the top. In either case, we have a delta deletion, gamma insertion, which we can do with any choice of gamma or delta.
\item We can only do this when $q$ is very large, giving us the $poly(n)$ size.
\end{itemize}
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{intermediate.png}
\caption{Construction of the intermediate codeword}
\end{figure}
\noindent How was the actual code constructed? Take a Hamming code $E: \Sigma^k \to \Sigma^n \to E':\Sigma^k \to (\Sigma \times [n])^n$
\begin{itemize}
\item Define $E'(x)_i = (E(x)_i, i)$
\item Rate$(E')$ = Rate$(E) - \frac{\log n}{\log |\Sigma|}$ for $n << |\Sigma|$
\end{itemize}
\noindent So, for some Hamming code, we construct an edit code $E'$ where the new encoding of $x$ in the $i-$th location is a pair including the index $i$. This new alphabet is larger and takes pairs; which is convenient when we are concerned about editing errors; each symbol also contains the location of the deletion. This works because we can rewrite an insertion or deletion for $E'$ as an erasure in $E$; and a same-location insertion/deletion (the adversary deletes a coordinate in location $i$ and then inserts another coordinate at location $i$ as an error in $E$ (two errors that translate into a single Hamming error). So, if we can deal with a certain number of insertions and deletions with a code $E$ of distance $\gamma + \delta$, we can construct a $(\delta, \gamma)-$edit distance code. The best code of distance $\gamma + \delta$ is a Reed-Solomon code, with rate $1-(\delta + \gamma)$; so the resulting edit code will be arbitrarily close, losing a little bit by the log terms.
\subsection{Recent Work}
Schulman-Zuckerman only works when $q$ is very large. We want to think about constant-$q$ list deocding, where $q = O(1)$. This does not mean that $q$ is binary, but that it is a very large constant. In order to do this, we need something similar to indexing, but we don't want to use an alphabet growing with $n$.
\begin{itemize}
\item \textit{Haeupler-Shahrasbi '17:} For $q = O(q)$, $R \to 1-(\delta+\gamma)$. This is the same rate as Schulman and Zuckerman, also for unique decoding. This effectively simulates a Hamming error: deletion of a character and insertion in the same location. So, $HS'17$ implies $R = 1-2\delta$, subsuming Hamming bounds and approaching the singleton bound.
\item \textit{Haeupler-Shahrasbi-Sudan '18:} For $q = O_{\gamma, \delta, R}(1)$, $R\to 1-\delta$. There is no restriction on $\gamma$, so while list size may grow in $\gamma$ with adversarial addition of a significant number of symbols, we can still output the original message - despite large $\gamma$! This work subsumes what we know for list decoding, and is what we have seen in a previous problem set.
\end{itemize}
Haeupler-Shahrasbi strategy:
\begin{itemize}
\item Synchronization string $S = S_i \in [c]^n$ with a constant sized set $c = O(1) << n$
\item We still have $E: \Sigma^k \to \Sigma^n$, but we construct off of this an edit-code $E': \Sigma^k \to (\Sigma \times [c])^n$
\item Our new edit code is constructed as $E'(x)_i = (E(x)_i, S_i)$
\item Rate$(E')$ = Rate$(E) - \frac{\log c}{\log |\Sigma|}$
\end{itemize}
\noindent This is fairly similar to the original Schulman-Zuckerman scheme, except we no longer have that $i\neq j \implies S_i \neq S_j$; we can't hope to simulate this with $c < n$ by the pigeonhole principle. So, we can define properties of $S$ that are useful, presuming that $S$ is constructible (as a tamer task than constructing a $2^k$ error correcting code).
\noindent \defn[Synchronization Strings]{A string is a $\epsilon-$synch. string if $\forall S-$ matching $M$, $|M| \leq \epsilon n$.}, where we call $M = \{(i_t, j_t) | 1 \leq t \leq m\} \subseteq [n] \times [n]$ $S-$matching if
\begin{itemize}
\item there are distinct $i_1, ..., i_m$, $j_1, \dots, j_m$
\item This is nontrivially $S-$valid: $S_{i_t} = S_{j_t}$ but $i_t \neq j_t$ $\forall t$. There should be no vertical matching, and an element should be matched always to the same letter.
\item This is monotone: $i_a < i_b \implies j_a < j_b$, thus maintaining the relative sequencing of elements.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[width=.7\textwidth]{monotone.png}
\caption{An example of a matching preserving monotonicity with no trivial matching.}
\end{figure}
\noindent We can verify if $S$ is $\epsilon$-synch. in polynomial time using dynamic programming to check what the best string is to match. So, for all $\epsilon > 0$, $\exists c < \infty$ such that $\forall n \exists S \in [c]^n$ constant edit distance codes that are rate optimal. The relevant question is then how said strings help decoding errors.
\begin{exercise}
Prove that a random string is $\epsilon-$synch. w.h.p.
\end{exercise}
\section{List decoding + Synchronization Strings}
\subsection{Definitions}
\begin{itemize}
\item List decodability: $C \subseteq \Sigma^n$ is $(\delta, L)$-list-decodable if
\begin{itemize}
\item for every $y = (y_1, ..., y_n) \in \Sigma^n$ if $S = \{x \in C | \#\{i | x_i \neq y_i\} \leq \delta n\}$, then $|S| \leq L$ and S can be computed efficiently given $y$.
\end{itemize}
\begin{exercise} Prove that for any $q$ and $\gamma < q-1$, the family of codes with rate $R < 1- \log_q(\gamma + 1) - \gamma \log_q\frac{\gamma+1}{\gamma} - \frac{\gamma +1}{l+1}$ is list-decodable with a $l-$sized list from any $\gamma n$ insertions w.h.p.
\end{exercise}
\item List-recoverable codes:
\begin{itemize}
\item $C \subseteq \Sigma^n$ is $(l, \delta, L)$ list recoverable if: for every $Y_1 \times ... Y_n \subseteq \Sigma^n$ s.t. $|Y_i| \leq l$, if the set of codewords in $C$ where $x_i \notin Y_i$ has size less than $L$: $S = \{x \in C | \#\{i | x_i \notin Y_i\} \leq \delta n\}$.
\item A starting point for constructing this is a good list decodable code; except instead of being a single word for each coordinate, we're given a set of values - the relevant question is to find codewords which pass through many of these sets, given that we miss a $\delta-$fraction (errors).
\end{itemize}
\begin{theorem}[Guruswami-Rudra'06]{$\forall l, \delta, \epsilon > 0$, $\exists$ $\Sigma, L$ such that $\exists$ a family of $(l, \delta, L)$-list-recoverable codes of rate $1-\delta -\epsilon$}
\end{theorem}
\item So, regardless of how large the $l$ is, we can make a list-recoverable code of rate $1- \delta -\epsilon$, with possible starting points being Folded RS codes or the Guruswami-Indyk alphabet reduction.
\end{itemize}
\subsection{Edit Errors}
\begin{theorem} Let $E: \Sigma^k \to \Sigma^n$ be $(l, \delta + \epsilon, L)$-list-recoverable. Let $S \in [c]^n$ be an $\epsilon'-$synch. string. Then $E': \Sigma^k \to (\Sigma \times [c])^n$ given by $E'(x)_i = (E(x)_i, S_i)$ is a $(\delta, \gamma)-$list-decodable code.
\begin{itemize}
\item Where $l = \frac{2(1+\gamma)}{\epsilon}$ and $\epsilon' = \frac{\epsilon}{2l}$
\end{itemize}
\end{theorem}
We write an algorithm to prove that this works: given $(a_i, b_i)_{i \in [m]}, a_i \in \Sigma, b_I \in [c]$,
\begin{itemize}
\item Set $B = (b_1, \dots b_m); Y_1 = \dots = Y_n = \varnothing$.
We want to write $S =s_1, \dots, s_n$-monotone matching without crossovers, finding the largest such matching. This problem is solvable in polynomial time, because it is just the largest common subsequence problem
\item For $l$ iterations, do:
\begin{itemize}
\item Let $M$ be the largest monotone matching between $B$ and $S$.
\item Remove the matched part from $B$, add matched $\alpha$-symbols into $Y_iS$: $b_ \leftrightarrow S_i \implies Y_i \leftarrow Y_i \cup \{a_j\}$.
\textit{Note:} At each iteration $i$, there is at most one element added to $Y_i$, because $b_j$ can only be mapped to one $s_i$ per iteration, and is then deleted and consequently cannot be mapped again.
\item List recover from $Y_1,\dots, Y_n$. These are subsets of $\Sigma$ of at most size $l$, because we add one element per iteration over $l$ iterations.
\end{itemize}
\end{itemize}
\subsection{Analysis of Algorithm}
Say that we have transmitted $E'(x)$, received pairs $(a_j, b_j)_{j \in [m]}$. If we can make sure that the $i$-th symbol encoding $E(x)_i$ does not appear in $Y_i$ for at most $\delta + \epsilon$-fraction of the coordinates, we can use list-recoverable codes from there.
\begin{itemize}
\item There can be $\delta n$ errors from deletions: if $E'(x)$ was transmitted, and the adversary deleted some $\delta n$ coordinates, there are at least $\delta-$fraction errors because that encoding will not appear in $Y_i$.
\item The other errors happen if $E'(x)_i = (a_j, b_j)$ is not deleted, but $E(x)_i \notin Y_i$. There are two possible cases here:
\begin{itemize}
\item Case 1: $b_j$ is matched to $S_i$ for $i' \neq i$:
\begin{itemize}
\item Because we have an $\epsilon'-$synch. string, this implies that there are at most $\epsilon'n$ errors per iteration, because we start in location $i$, went to location $j$, and then went to location $i'$. These two characters are the same and this transformation is monotone. So, there can be at most $\epsilon'n$ errors from insertion.
\item Over $l$ iterations, this means we have at most $l\epsilon' n$ errors $\implies \leq \frac{\epsilon n}{2}$ errors.
\end{itemize}
\item Case 2: $b_j$ remains unmatched at the end:
\begin{itemize}
\item Say that $\alpha n$ code symbols are unmatched at the end. By construction, each iteration must match $\geq \alpha n$ symbols, else there would be a longest common subsequence of length $\alpha n$ (\textit{Note: because $E'$ has a perfect matching with $S$, the $\alpha n$ unmatched symbols are necessarily left intact in sequence}). Over $l$ rounds, this gives us $l \alpha n \leq m \leq (1+\gamma(n)) \implies \alpha n \leq \frac{(1+\gamma)n}{l} \leq \frac{\epsilon n}{2}$
\end{itemize}
\end{itemize}
This gives us at most $\epsilon n$ errors from insertion and $\delta n$ errors fromd deletion, so we know that at most $\delta + \epsilon-$fraction of errors are corrupted, and we can then use list recoverable codes.
\end{itemize}
Further work being done on this problem includes editing and interaction errors, binary codes for edit distance (see Guruswami, Haeupler, Shahrasbi '19), and coding on small alphabets rather than large ones.
\section{Course Recap!}
\textbf{What did we do?}
\begin{itemize}
\item Existence and limits of codes:
\begin{itemize}
\item Greedy and random constructions, treated as largely interchangeable.
\item How do you prove that codes of a certain type don't exist? Packing bounds, list packing, embedding into Euclidean space: there are codes that shouldn't touch; codes for which the Hamming balls overlap (but not by much), and codes of large distances cannot have too many code words. Embedding into Euclidean space is especially useful by embedding from Hamming space and then allowing us to use analysis over real vectors.
\end{itemize}
\item Constructions + Algorithms:
\begin{itemize}
\item Algebraic codes: these have very nice combinatorial properties that are mostly better than random. The only code where we didn't cover anything explicit was Reed-Muller, which seems to do worse than random codes. However, RM codes are locally decodable and testable while random codes are not.
\item Graph theoretic and information theoretic (polar) codes, where the motivation for these is algorithmic, achieving what we can get that doesn't work in other codes.
\end{itemize}
\item Advanced topics
\begin{itemize}
\item Local decoding
\item Interactive coding - rarely used in practice, most interactive protocols have a constant number of rounds of interaction; in which case you might as well encode each round of communication
\item Edit Errors - today's lecture!
\end{itemize}
\item Applications in complexity:
\begin{itemize}
\item Short coverage of topics in 'baby' pseudorandomness - limited independence and $\epsilon-$bias - and 'baby' crypotgraphy - hard code bias
\end{itemize}
\end{itemize}
\textbf{What didn't we do:}
\begin{itemize}
\item Bounds: Cover linear programming bounds that tell us that Chernoff bounds are largely optimal
\item Codes: Didn't give explicit constructions for algebraic-geometric codes
\item Algorithms:
\begin{itemize}
\item Low density parity check codes and decoding. Luckily, these are dominated by polar codes, so there are no nice theorems we'd learn with covering LDPC codes.
\item Linear time machinery (see Spielman codes, Guruswami and Indyk '03 for further reference)
\end{itemize}
\item Explicit graph construction - despite using many for samplers, expanders, etc.
\item "Modern" topics, such as network coding and coding for network coding, quantum error correction
\item Lots of applications: Hashing, Shamir secret sharing, group testing, streaming and data structures, pseudorandom generators, and probablistically checkable proofs.
\end{itemize}
Thanks to Madhu and Chi-Ning for a great semester!
\end{document}