\documentclass[11pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\DeclareMathOperator*{\E}{\mathbb{E}}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbb{P}}
\newcommand{\eps}{\varepsilon}
\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\R}{\mathbb{R}}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf CS 229r: Algorithms for Big Data } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
\begin{document}
\lecture{8 --- Sept. 26, 2013}{Fall 2013}{Prof.\ Jelani Nelson}{Anudhyan Boral}
\section{Overview}
In this lecture we begin a new topic: space lower bounds for streaming algorithms. Our main tool is going to be communication complexity. Some examples of such space lower bounds are:
\begin{itemize}
\item We need $\Omega(\varepsilon^{-2} + \log n)$ bits of space for the $F_0$ problem in the streaming model.
\item Computing the exact median deterministically requires $\Omega(n)$ space.
\item 2-approximation of $F_p$ requires $\Omega(n^{1 - \frac{2}{p}})$ bits. (Hence for $p > 2$ we must use at least polynomial space instead of polylogarithmic).
\end{itemize}
\section{Communication Complexity}
Consider a (co-operative) game between Alice and Bob. Alice is given an element $x$ from some large set $X$. Bob is given an element $y$ from some large set $Y$. Alice and Bob want to compute a function $f : X \times Y \rightarrow \{0,1\}$ by collaborating with each other. They want to minimize the number of bits of communication between them.
Depending on the number of rounds allowed, Alice and Bob take turns passing messages to each other. First, Alice sends Bob a message $m_0$. Then, Bob sends Alice a message $m_1$, and so on. The communication game with $r$ rounds involves the sending of $r$ messages.
There are several variants of the communication game. They differ, for instance, in the person having to decide the value of $f(x,y)$. It could be Alice, or Bob or even a third party agent observing all the communication.
The main observation which allows us to use the theory of communication complexity is: \emph{One-way communication lower bounds imply streaming space lower bounds}.
\section{Different Types of Communication Complexity}
\subsection{Deterministic Communication Complexity}
We denote by $D(f)$ the minimum number of bits required by a deterministic communication protocol that always correctly computes $f$. That is, Alice and Bob behave deterministically.
\subsection{Randomized Communication Complexity}
$R^{\text{pri}}_{\delta}(f)$ is the minimum number of bits required to obtain $f$ with success probability at least $1 - \delta$ where Alice and Bob both have access to private random coins.
$R^{\text{pub}}_{\delta}(f)$ is the same as $R^{\text{pri}}_{\delta}(f)$ except that the random coins are public. (Imagine there is an infinite random string written in the sky, where both Alice and Bob can see it.)
\subsection{Distributional Communication Complexity}
$D_{\mu, \delta}(f)$ is the \emph{distributional communication complexity}, the number of bits required to compute $f$ when the inputs are drawn from a fixed distribution of inputs $\mu$ and the success probability required is $1 - \delta$.
It is easy to see that $D(f) \geq R^{\text{pri}}_{\delta}(f) \geq R^{\text{pub}}_{\delta}(f) \geq D_{\mu, \delta}(f)$. (The last inequality is shown by invoking Yao's principle.) For randomized streaming algorithms the most relevant model is usually $R^{\text{pri}}_{\delta}(f)$. But for many examples lower-bounding $R^{\text{pub}}_{\delta}(f)$ is easier.
The book \emph{Communication Complexity} by Kushilevitz and Nisan \cite{KNisan} is an excellent reference for Communication Complexity.
\section{Exact deterministic $F_0$ requires $\Omega(n)$ bits}
To prove the lower bound, we want to reduce some hard communication function to the property we want to compute.
Consider the function $EQ : \{0,1\}^n \to \{0,1\}^n$ defined as:
$EQ(x,y) =
\left\{
\begin{array}{ll}
1 & \mbox{if } x = y \\
0 & \mbox{if } \text{otherwise}\\
\end{array}
\right.
$
\begin{claim} $D(EQ) = \Omega(n)$
\end{claim}
\begin{proof} The proof is obvious by the pigeonhole principle.
\end{proof}
The reduction goes as follows: Suppose we are given a streaming algorithm $\mathcal{A}$ which computes $F_0$ using $o(n)$ space. Using her input $x \in \{0,1\}^n$, Alice constructs the stream $s_x = \{i : x_i = 1\}$. Alice streams $s_x$ through $\mathcal{A}$ and sends Bob the memory content $m_0$ of $\mathcal{A}$. Using $m_0$, Bob can figure out the string $x$. Initializing $\mathcal{A}$ with the memory content $m_0$, if feeding it $\{i\}$ changes the value of $F_0$ then $x_i = 0$; otherwise $x_i = 1$. Hence, as Bob knows $x$ as well as his own input $y$, he can easily compute $EQ(x,y)$.
\section{The Indexing Problem}
The communication problem of Indexing is as follows:
\begin{itemize}
\item Alice gets $x \in \{0,1\}^n$.
\item Bob gets $j \in [n]$.
\item $f(x,j) = x_j$.
\end{itemize}
\begin{claim} (Indexing lower bound) $R^{\text{pub}}_{\delta}(f) \geq (1 - H_2(\delta)n$, where $H_2(\delta) = -\delta \log_2(\delta) - (1 - \delta)\log_2(1 - \delta)$.
\end{claim}
Before proving this claim we introduce some basic information theory definitions and properties in the next few sections.
\section{Information Theory}
Let $X$ take values in some domain $\mathcal{X}$.
\begin{definition} $H(X) = \sum_{x \in \mathcal{X}} p_x\log_2 (p_x)$, where $p_x = \Pr[X = x]$.
\end{definition}
\begin{definition} $H(X, Y) = - \sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p_{x,y}\log_2 (p_{x,y})$, where $p_x = \Pr[X = x \text{ and } Y = y]$.
\end{definition}
\begin{definition} (conditional entropy) $H(X | Y) = \E_{y \in Y} H(X | Y = y)$.
\end{definition}
\begin{definition} (mutual information) $I(X; Y) = H(X) - H(X | Y)$
\end{definition}
Intuitively, the mutual information measures the amount of randomness left in $X$, if $Y$ is known. It can be proved (using simple manipulation) that $I(X; Y) = H(Y) - H(Y | X)$.
\section{Basic Properties of Entropy Functions}
\begin{enumerate}
\item \emph{Chain rule} $H(X, Y) = H(X) + H(Y | X)$
\item \emph{Subadditivity} $H(X, Y) \leq H(X) + H(Y)$ and equality holds iff $X$ and $Y$ are independent.
\item $H(X | Y) \leq H(X)$
\item $H(X) \leq \log(|X|)$.
\item \emph{Fano's Inequality} Suppose there is a deterministic function $g$ such that $\Pr[g(Y) \neq X) \leq \delta$, then $H(X | Y) \leq H(X | g(Y)) \leq H_2(\delta) + \delta\log_2(|\mathcal{X}| - 1)$.
\end{enumerate}
\section{Back to Indexing Lower Bound}
Suppose Bob gets the transcript of communication $\Pi$ and his input $J \in [n]$. We lower-bound the distributional complexity. Choose the 'hardest' distribution - which is the uniform distribution for $J$. And suppose Bob is able to predict $x_j$ with probability $\geq \delta$.
Using Fano's inequality, we get,
\begin{align*}
H_2(\delta) &\geq H(X_J | \Pi, J) \\
&= \sum_{j = 1}^n \Pr[J = j]\cdot H(X_J | J = j, \Pi) \\
&= \frac{1}{n} \sum_{j = 1}^n H(X_j | \Pi) \\
&\geq \frac{1}{n} \sum_{j = 1}^n H(X_j | \Pi, X_1, \cdots, X_{j - 1}) \text{ (conditioning on more variables can only reduce entropy) } \\
&= \frac{1}{n} (H(X_1, \cdots, X_n, \Pi) - H(X_1, X_2 \cdots, X_{n - 1} \cdots + H(x_1, \Pi) - H(\Pi)) \text{ (using chain rule) } \\
&= \frac{1}{n} (H(X_1, \cdots, X_n) - H(\Pi)) \text{ (by telescoping sums) }\\
&= 1 - \frac{1}{n}H(\Pi) \\
&\geq 1 - \frac{1}{n}|\Pi| \text{ ( since $H(\Pi) \leq |\Pi|$ ) }\\
\end{align*}
\section{Probabilistic Exact Median Lower Bound}
\begin{claim}
The exact median problem where success probability is $\geq 1 - \delta$ requires at least $(1 - H_2(\delta))n)$ bits of space, if all integers are in $[n]$ and the stream length is $2n - 1$.
\end{claim}
\begin{proof} The proof is via reduction from the Indexing problem.
Alice constructs the virtual stream $s_A = [2 + x_1, 4 + x_2, \cdots, 2i + x_i, \cdots]$, and Bob constructs the virtual stream $s_B = s_1 s_2$ where $s_1 = [0, 0, \cdots 0]$ ($n - j$ copies) and $s_2 = [2n + 2, 2n + 2, \cdots 2n + 2]$ ($j - 1$ copies).
Observe that the median of the concatenation of $s_A$ and $s_B$ is $2j + x_j$. If Alice sends Bob the memory content of the algorithm, then Bob can run the algorithm on the concatenated streams, and hence figure out the value of $x_j$. (Recall that he already knows $j$).
\end{proof}
\section{The $F_0$ Lower Bound}
We know that computing $F_0$ with error $\varepsilon$ requires $\Omega(\varepsilon^{-2} + \log n)$ bits of space. The $\log n$ comes from \cite{AlonMS99}. Here we will show the $\Omega(1/\eps^2)$ bound. In our proof, we will use reduce the Gap Hamming problem.
\textbf{The Gap Hamming Problem}: Alice is given $x \in \{0,1\}^N$; Bob is given $y \in \{0, 1\}^N$. $f = \Delta(x,y)$, where $\Delta(x,y)$ is the Hamming distance between $x$ and $y$. That is, the number of bit positions where $x$ and $y$ differ. The error allowed is upto $\pm \sqrt{N}$.
\begin{theorem} $R_{1/3}^{\text{pub}}$ (Gap Hamming) $ = \Omega(N)$. \end{theorem}
The proof for the above theorem with one-way communication by a reduction from indexing can be found in \cite{JKS08}. The first proof of the optimal one way communication lower bound for Gap Hamming was in \cite{W04}.
\begin{claim}Computing $F_0$ with error $\varepsilon$ requires $\Omega(\varepsilon^{-2})$ bits. (As long as $\varepsilon^{-2} \leq n$)
\end{claim}
\begin{proof} We reduce from Gap Hamming. Set $N = c\varepsilon^{-2}$. Alice and Bob get $x, y \in \{0,1\}^N$ respectively. Observe that $2F_0 = w(x) + w(y) + \Delta(x,y)$. (Here $w(x)$ denotes the number of 1's in $x$).
Alice sends Bob $w(x)$, the weight (number of $1$'s) of $x$, along with the memory content of the algorithm after the stream $s_x$. Bob estimates $\Delta(x,y) = 2F_0 - w(x) - w(y)$.
\end{proof}
\section{Disjointness Problem and the $F_p$ Lower Bound}
We now introduce the communication problem ($t$-player) Disjointness which is useful for proving the space lower bounds for $F_p$.
First, we introduce a generic $t$-player communication game. There are $t$ Alice's denoted by $A_1, A_2, \cdot A_t$. Each $A_i$ is given an input $x_i$. $A_1$ sends a message to $A_2$, $A_2$ sends one to $A_3$ and so on. One round consists of $t - 1$ message sendings and after that $A_t$ has to report the value of $f(x_1,x_2,\cdots,x_t)$
We obtain space lower bounds for a $(t-1)$-pass streaming algorithm using a $t$-player game. In that case, space required is at least (communication lower bound)$/(t - 1)$.
\textbf{Disjointness Problem ($t$-player)} Let $x_1, \cdots, x_t \in \{0,1\}^n$, and let $x_i$ be indicator vectors for sets $A_i \subset [n]$. And, define $f$ as: \\
$$
f(x_1,x_2,\cdots x_n) =
\left\{
\begin{array}{ll}
1 & \mbox{if } \forall i \neq j, A_i \cap A_j \neq \emptyset \\
0 & \mbox{if } \forall i \neq j A_i \cap A_j = \{k\} \text{ for some fixed $k \in [n]$ }\\
\text{undefined} & \text{ otherwise} \\
\end{array}
\right.
$$
We are given the guarantee that $f$ is not undefined, hence it evaluates to 0 or 1. The problem is to decide whether $f$ is 0 or 1.
We now state a theorem which would require about 3 lectures to prove, and is hence left out of the course. The proof also uses an information theoretic approach, known as {\em information complexity} \cite{CSWY01}. The idea is the following chain of inequalities, where $\Pi$ is the optimal $\delta$-error communication protocol for some function $f$: $R^{pub}_\delta(f) = |\Pi| \ge H(\Pi(\mathbf{X})) \ge I(\mathbf{X};\Pi(X))$, where $\mathbf{X}$ is the set of inputs given to the $t$ players, and $\Pi(\mathbf{X})$ is the transcript of the communication protocol (or the ``communication log'') when the input is $\mathbf{X}$ (note that it is a random variable since $\Pi$ uses randomness). Then we define the {\em information complexity} $IC_{\mu,\delta}(f)$ as the minimum value $I(\mathbf{X},\Pi(X))$ achievable by any $\delta$-error protocol $\Pi$ when $\mathbf{X}$ is drawn from distribution $\mu$. Then we have that $R^{pub}_{\delta}(f) \ge IC_{\mu,\delta}(f)$ for all $\mu$. A variant of this approach was used by \cite{BJKS04} to obtain lower bounds for $t$-player disjointness, with improvements in \cite{CKS03}. The sharp bound was shown in \cite{G09}, with a later work showing how the arguments in \cite{BJKS04} could be strengthened to also get the sharp bound \cite{J09}.
\begin{theorem}
$R_{1/3}^{\text{pub}} $($t$-player Disjointness) = $\Omega(n/t)$
\end{theorem}
\begin{claim} 2-approximation of $F_p$ requires $\Omega(n^{1 - \frac{2}{p}})$ bits of space. \end{claim}
\begin{proof} We reduce from $t$-player Disjointness, where $t = (2n)^{\frac{1}{p}}$. We do the usual thing where the $i$th player creates a virtual stream that contains $j$ iff $j\in A_i$. If all the $A_i$ are disjoint, then $F_p \le n$. If they all however intersect at a single point, then $F_p$ is at least $t^p = 2n$. Thus a $2$-approximation of $F_p$ can be used to solve the disjointness instance. For a space-$s$ streaming algorithm the total communication is $s(t-1)$, so the space lower bound we obtain is $\frac{n}{t(t - 1)} = \Omega(n/t^2)= \Omega(n^{1 - \frac{2}{p}})$.
\end{proof}
\bibliographystyle{alpha}
\begin{thebibliography}{42}
\bibitem{AlonMS99}
Noga~Alon, Yossi~Matias, Mario~Szegedy.
\newblock The Space Complexity of Approximating the Frequency Moments.
\newblock {\em J. Comput. Syst. Sci.}, 58(1):137--147, 1999.
\bibitem{BJKS04}
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar.
\newblock An information statistics approach to data stream and communication complexity.
\newblock {\em J. Comput. Syst. Sci.}, 68(4): 702--732, 2004.
\bibitem{CKS03}
Amit Chakrabarti, Subhash Khot, Xiaodong Sun.
\newblock Near-Optimal Lower Bounds on the Multi-Party Communication Complexity of Set Disjointness.
\newblock {\em IEEE Conference on Computational Complexity}, pgs.\ 107--17, 2003.
\bibitem{CSWY01}
Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, Andrew Chi-Chih Yao.
\newblock Informational Complexity and the Direct Sum Problem for Simultaneous Message Complexity.
\newblock {\em FOCS}, pgs.\ 270--278, 2001.
\bibitem{G09}
Andre Gronemeier.
\newblock Asymptotically Optimal Lower Bounds on the NIH-Multi-Party Information Complexity of the AND-Function and Disjointness.
\newblock {\em STACS}, pgs.\ 505--516, 2009.
\bibitem{J09}
T. S. Jayram.
\newblock Hellinger Strikes Back: A Note on the Multi-party Information Complexity of AND.
\newblock {\em APPROX-RANDOM}, pgs.\ 562--573, 2009.
\bibitem{JKS08}
T. S. Jayram, Ravi Kumar, D. Sivakumar.
\newblock The One-Way Communication Complexity of Hamming Distance.
\newblock {\em Theory of Computing}, 4(1): 129--135, 2008.
\bibitem{KNisan}
Eyal~Kushilevitz, Noam~Nisan
\newblock Communication Complexity
\newblock {\em Cambridge University Press}, 1997
\bibitem{W04}
David P. Woodruff.
\newblock Optimal space lower bounds for all frequency moments.
\newblock {\em SODA}, pgs.\ 167--175, 2004.
\end{thebibliography}
\end{document}