\documentclass[10pt]{article}
\usepackage{amsmath,amsthm,amssymb,amsfonts,
hyperref, color, graphicx,ulem}
\usepackage[margin=1.25in]{geometry}
%\usepackage{epsfig}
\title{CS229 Lecture 16 Notes}
\begin{document}
\input{preamble.tex}
\lecture{16}{\today}{Madhu Sudan}{Jack Murtagh}
%%%% body goes in here %%%%
\section{Streaming Lower Bounds - Introduction to Model}
Streaming algorithms are typically used in settings where we want to compute or approximate a function on data that is too large to fit in one computer. We should think of the data as so large that the algorithm can only make one pass over the input, i.e. it is ``streaming'' in, one element at a time. The goal of streaming algorithms is to minimize the space used (as a function of the input size) while approximating the target function as best as possible. One example application of streaming algorithms is a router, where we have a large amount of data flowing through a small device and we may want to track some statistics about the data. We will survey some fundamental results in this area and discuss lower bounds that can be proved with techniques from communication complexity.
To make this more precise, let $x_{1},x_{2},\ldots,x_{m}$ be the input, which we will imagine as very large and streaming into our algorithm one element at a time. Our goal is to compute or approximate a function $f(x_{1},x_{2},\ldots, x_{m})$ while only using $O(\log^{c}m)$ space for some constant $c$. One of the great surprises in streaming algorithms is that there is actually a relatively large collection of tasks for which this task can be accomplished effectively.
\section{Algorithms for Frequency Estimation}
One of the seminal papers in the theory of streaming algorithms is due to Alon, Matias and Szegedy \cite{AMS99} on the frequency estimation problem. Given input $x_{1},x_{2},\ldots,x_{m}$, where $x_{i}\in[n]$ for all $i$, define the frequency vector $(f_{1},f_{2},\ldots, f_{n})$ where $f_{i}$ is the number of occurrences of $i$ in the data stream. In other words
\[
f_{i}\triangleq\left|\{j\mid x_{j}=i\}\right|
\]
The goal is to compute the $k$th frequency moment $F_{k}$ of the stream, defined as
\[
F_{k}(x_{1},x_{2},\ldots,x_{m})\triangleq\sum_{i=1}^{n}f_{i}^{k}
\]
Note that
\begin{itemize}
\item $F_{1}$ is always equal to $m$ and this is trivially countable in logspace.
\item $F_{0}=\left|\{i\mid \exists j \text{~s.t~} x_{j}=i\}\right|=$ the number of distinct elements in the stream. For this interpretation to really make sense, we should think of $F_{0}=\lim_{k\to 0}F_{k}$. There is a known randomized algorithm for approximating $F_{0}$ \cite{FM85}.
\end{itemize}
Alon, Matias, Szegedy gave an elegant randomized algorithm for approximating $F_{2}$, which can be thought of as the variance of the data stream. For all the algorithms presented here, we will solve a weaker version of the problems by assuming that the streaming algorithms have access to some very powerful hash function that is more or less random. Everything shown here can be cleaned up though by using other hash function constructions. This is important because we do not want to use a hash function that requires us to store a huge table because the whole point of these algorithms is to conserve space.
The streaming algorithm works as follows. Associate with $[n]$ random bits $R_{1},R_{2},\ldots, R_{n}$ with $R_{i}\in\{-1, +1\}$ for all $i\in[n]$. Instead of directly computing $\sum_{i=1}^{n}f_{i}^{2}$, we will compute
\[
A=\sum_{j=1}^{m}R_{x_{j}}\cdot x_{j}
\]
This is just a linear function and can clearly be computed efficiently assuming that we have oracle access to the sequence of random bits. Then we just output $A^{2}$. The point is that
\begin{align*}
A^{2}&=\left(\sum_{j=1}^{m}R_{x_{j}}\cdot x_{j}\right)^{2}\\
&=\sum_{1\leq i,k\leq n}R_{i}R_{k}f_{i}f_{k}\\
&=\sum_{i=1}^{n}f_{i}^{2}+\sum_{i\not=k}R_{i}R_{k}f_{i}f_{k}
\end{align*}
The first term of the last line is exactly what we wish to compute and we can view the second term as noise in our approximation. When you analyze the expectation and variance of the noise quantity and apply Chebyshev's inequality, you find that the term goes to zero. In \cite{AMS99} they also show lower bounds for frequency estimation. For example, they prove that $F_{p}$ estimation for $p> 2$ requires $\Omega(n^{1-2/p})$ space, showing that estimating higher moments incurs higher space complexity. Note that $F_{p}$ estimation is trivial using $O(n)$ space because you can just store the entire frequency vector.
In 1985, Flajolet and Martin gave a beautiful algorithm for $F_{0}$ estimation \cite{FM85}. Begin by picking a random hash function
\[h\colon[n]\to[0,1].\]
In the sequence $h(x_{1}),h(x_{2}),\ldots h(x_{m})$, let
\[
h_{\min}\triangleq\min_{j\in[m]}h(x_{j})
\]
and output $\frac{1}{h_{\min}}$. This is easy to compute because as the stream comes in we compute $h(x_{i})$ for all $i$ and we only need to maintain the minimum. If the hash value of the incoming element is smaller than our current stored minimum, we just replace it with the new value. This works because if we have say, elements from the set $\{1,2,\ldots,10\}$ then the expected minimum hash value in $[0,1]$ should be around $1/10$. Outputting the reciprocal of this then gives us the number of distinct elements in the stream. The original versions of this algorithm got results of the form:
\[
\Pr\left[\text{output~}\not\in [\frac{1}{c}\cdot F_{0},c\cdot F_{0}]\right]\leq \frac{1}{c}
\]
We can actually do better if we don't just track the minimum hash value but rather the $t$ smallest hash values of the stream \cite{bjkst}. Then if $h_{t}$ is the $t$th smallest hash value, output $\frac{h_{t}}{h_{\min}}$. This estimator is more tightly concentrated around the target value than the first approach of just outputting the minimum. It turns out that this algorithm gives a $\theta(1/\sqrt{t})$ approximation to $F_{0}$. Setting $t$ to $1/\epsilon^{2}$ then gives us a $(1\pm\epsilon)$ approximation to $F_{0}$ and the algorithm uses space $O\left(\frac{1}{\epsilon^{2}}\cdot\text{polylog}(m)\right)$. The latest results are able to shave the polylog$(m)$ in the space complexity to a polyloglog$(m)$ but unfortunately we will show in the next section that the $1/\epsilon^{2}$ factor for a $(1\pm\epsilon)$ approximation is optimal.
\section{\texorpdfstring{$F_{0}$}~-Estimation Lower Bounds via Communication Complexity}
In general, lower bounds in communication complexity naturally translate to lower bounds in streaming algorithms. These are often proved using the contrapostive - that upper bounds in streaming imply upper bounds in communication complexity. To draw the connection between the two, think of streaming as an $m$-player game where player $x_{1}$ sends $s_{1}$ to player $x_{2}$, who sends $s_{2}$ to player $x_{3}$ etc. We want $s_{m}$ to be a good approximation of $f(x_{1},\ldots,x_{m})$, the function we wish to compute. Amazingly, we can simplify this to the case of just two player communication by splitting the stream in half and we still get strong results. Specifically, we will think of Alice receiving as input $x_{1},\ldots,x_{m/2}$ and Bob receiving $x_{m/2+1},\ldots x_{m}$. Alice then computes some function of her inputs and sends a message to Bob, who must then output a value for the function. This further restriction is called one-way communication and the idea is that a space $S$ algorithm for the streaming problem implies an $S$ bit protocol for the analogous communication problem.
\subsection{Reduction to Gap Hamming}
We will prove the desired lower bound by a reduction to the Gap Hamming Problem \cite{IW}. In this problem, Alice recieves $x\in\{-1,+1\}^{n}$ and Bob receives $y\in\{-1,+1\}^{n}$. The goal is to decide whether:
\begin{align*}
\langle x,y\rangle &\geq +g\\
\text{or} ~~\langle x,y\rangle &\leq -g,
\end{align*}
where $\langle x,y\rangle $ is the inner product between $x$ and $y$:
\[
\langle x,y\rangle \triangleq\sum_{i=1}^{n}x_{i}y_{i}
\]
Obviously the inner product lies between $-n$ and $n$ and roughly what this question is asking is how often the strings $x$ and $y$ agree. The larger the inner product, the more the two strings agree, the smaller it is, the more they disagree, and the closer it is to 0, the less correlation the strings have. There is a fairly straightforward reduction from the disjointness problem that implies that the Gap Hamming Problem with $g=1$ requires $\Omega(n)$ communication.
For upper bounds, consider the extreme case where $g=\epsilon\cdot n$. A reasonable protocol is to sample a subset of the coordinates using correlated randomness and check the number of coordinates in the sample on which $x$ and $y$ agree (Alice just sends the $x$ values on these coordinates). If we pick a large enough sample, the sample probability of agreement should closely approximate the total probability of agreement between the strings. Choosing $1/\epsilon^{2}$ coordinates suffices for good approximations so the communication complexity is $1/\epsilon^{2}$ when $g=\epsilon\cdot n$. In other words CC is $O((n/g)^{2})$. We can also always take the naive approach where Alice just sends $x$ to Bob to get CC of $\min\{O((n/g)^{2}),n\}$. This is an easy upper bound and it turns out this is actually tight, which means in the regime of $g=\sqrt{n}$, we get a communication complexity lower bound of $\Omega(n)$ \cite{IW}. This linear lower bound was for the one way version of the Gap Hamming Problem, which will suffice for our purposes but later on it was proven that in fact the general two way version of the problem requires $\Omega(n)$ communication \cite{CR, sherstov}.
Now we show that a Gap Hamming lower bound implies a lower bound on $F_{0}$ estimation. The idea is that Alice has $x\in\{-1,+1\}$ and Bob has $y\in\{-1,+1\}$. Alice computes $S\subseteq [n]=\{i\mid x_{i}=1\}$ and Bob computes $T\subseteq [n]=\{j\mid y_{j}=1\}$. We want to write $\langle x,y\rangle$ as a function of $|S|,|T|,$ and $|S\cup T|$, which will imply that if we can compute $F_{0}$ estimations, then we can approximate the sizes of these sets, which in turn allows us to compute the inner product. Notice that:
\[
\langle x,y\rangle = |S\cap T| + (n-|S\cup T|) - |S\setminus T|-|T\setminus S|
\]
After expressing each of the terms above in terms of just $|S|,|T|,$ and $|S\cup T|$, we get the desired result and in fact our $F_{0}$ estimation will be $|S\cup T|$. Now we just need to show how Alice and Bob can estimate the size of $|S\cup T|$. The protocol starts with Alice sending $|S|$ to Bob, only using $O(\log n)$ bits. Then she creates a stream of elements of $S$ followed by elements of $T$ and Bob can calculate $|S\cup T|$. This is saying that if we can get a $\frac{\sqrt{n}}{n}$ approximation to $|S\cup T|$ in space $o(n)$, it would imply a $o(n)$ one-way communication protocol for the Gap Hamming Problem, which would contradict the known lower bound.
\subsection{Reduction to Indexing}
Finally, we will show the one-way communication lower bound on the Gap Hamming Problem. The idea is to reduce from the Indexing Problem. In this problem, Alice gets as input a string $z\in\{0,1\}^{n}$ and Bob gets an index $i\in [n]$. Their goal is to output $z_{i}$. This is clearly very easy to solve with two-way communication. Bob simply sends his coordinate to Alice using $\log n$ bits and she can output $z_{i}$. It turns out that this problem is hard for one-way communication.
Given an instance of Indexing, we want to show how to solve it using a protocol for Gap Hamming. Let $R$ be a large $n\times n$ matrix of correlated randomness containing elements of $\{-1,+1\}$. Alice will compute $Rz=\tilde{x}$, a column vector of length $n$ and then convert this to a vector $x$ by taking the sign of each element in $\tilde{x}$ (if a coordinate is $0$, just call it $+1$). Bob does the same thing with the vector $e_{i}$, containing all zeroes except a one in the $i$th coordinate. He computes $Re_{i}=y$. The claim is that:
\begin{align*}
\text{If }z_{i}&=1~ \text{~then~} \langle x,y\rangle \text{~is large w.h.p} \\
\text{If }z_{i}&=0~\text{~then~} \langle x,y\rangle < 0 \text{~w.p~} 1/2
\end{align*}
To see this, we will start with the second case. Notice that $y$ is just the $i$th column of $R$. So if $z_{i}=0$ then $x$ is completely independent of column $i$ in $R$. So $x$ and $y$ are independent and therefore their inner product is negative with probability $1/2$.
If $z_{i}=1$, then $x$ now also picks up column $i$ in the multiplication and we are asking what the probability is that the new vector $x$ is correlated with column $i=$ vector $y$. Let $R^{(i)}$ be $R$ with the $i$th column removed and $z^{(i)}$ be $z$ with the $i$th coordinate removed. Now we can write $\tilde{x}$ in the form
\[
\tilde{x}=(R\cdot R^{(i)})\cdot z^{(i)}+y
\]
Again we convert $\tilde{x}$ to $x$ by taking the sign of every element. We want to know how correlated this is to $y$ and with what probability. Another way of asking this question is, when we add column $y$ above, what is the probability that that flips the sign of coordinate $j$ in $x$? Each entry of $x$ before adding $y$ is just the sum of independent random signs and so you expect the distribution to look like a Gaussian. So the probability that we actually flip the sign of coordinate $j$ when adding $y$ is $1/\sqrt{n}$ for every $j$. We are adding up $n$ of these so we get that $\langle x,y\rangle = \frac{n}{\sqrt{n}}$ with high probability.
\subsection{Complexity of Indexing}
To complete the lower bound on $F_{0}$ estimation, we should observe the lower bound on the Indexing problem. First consider what can be done with a deterministic protocol. Alice sends some a message $m$ to Bob. Suppose $g$ is the optimal function for Bob to use on Alice's message. Bob can compute $g(m,1)=\tilde{z}_{1}, g(m,2)=\tilde{z}_{2}, \ldots, g(m,n)=\tilde{z}_{n}$. Clearly the only way to guarantee that Bob learns $z_{i}$, is if $\tilde{z}_{j}=z_{j}$ for all $j$. So there must be a one to one correspondence between possible $z$ vectors and possible $\tilde{z}$ vectors. So if $m