\documentclass[11pt]{article}
\usepackage{amsmath,amssymb,amsthm,tikz}
\usepackage{url}
\usetikzlibrary{trees}
\DeclareMathOperator*{\E}{\mathbb{E}}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbb{P}}
\newcommand{\eps}{\varepsilon}
\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\R}{\mathbb{R}}
\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf CS 229r: Algorithms for Big Data } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newcommand{\bs}[1]{\boldsymbol{#1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
\begin{document}
\lecture{3 --- September 10, 2013}{Fall 2013}{Prof.\ Jelani Nelson}{Christopher Musco}
\section{Overview}
In the previous lecture we discussed algorithms (idealized and practical) for the Distinct Elements problem, or the``$F_0$ problem". A section was also added to the notes on $F_2$ estimation for a continually updated vector $\bs{x} \in \R^n$. Using the AMS sketch algorithm presented in \cite{AlonMS99}, we saw that, with just $O(\epsilon^{-2}\log(1/\delta)\log(n))$ bits of space, it is possible to maintain an $\epsilon$-good approximation for $\|\bs{x}\|_2^2$ with probability $> 1 - \delta$.
\\\\In this lecture we'll continue looking at sketching. We will:
\begin{enumerate}
\item {\em} Review the ``turnstile stream" problem.
\item {\em} Consider linear sketches for $F_p$ estimation when $02$ case
\end{enumerate}
\section{Turnstile Streams}
As described in the Lecture 2 notes, the general streaming model we will follow is:
\begin{itemize}
\item Begin with an $n$ length vector $x$ initialized to $\bs{0}$.
\item Stream in a set of updates in the form $\{i,v\}$, where $i \in \{1,\ldots,n\}$ and $v \in \R^n$.\\
For each update, set $x_i \leftarrow x_i + v$. In practice, we probably want to be bounding $v$.
\item After streaming, return $f(x)$ for some $f$. Depending on what $f$ is, we should be able to do this without actually maintain $x$ (which could have many elements) explicitly.
\end{itemize}
In linear streaming, we'll reduce out space complexity by maintaining an $m$ length vector $y = \Pi\bs{x}$, where $\Pi \in \R^{m\times n}$. Our space complexity will be $O(m)$ + whatever space is required to store $\Pi$, which better be $ \frac{1}{2}m$. Now, if less than half of the $m$ elements in $\{1_i,\ldots,i_m\}$ lie below $(1-\epsilon)\|x\|_p$, the median must be above this value. If more than half of the $m$ elements lie above $(1+\epsilon)\|x\|_p$, the median must be below this value. So, we see that, in expectation, $median\left(\frac{y_i}{\|x\|_p}\right) \in [1-\epsilon, 1+ \epsilon]$ as desired.
\\\\Next we should look at the variance of these count estimates to make sure that the median actually falls in this range with high probability.
\begin{align}
\mathrm{Var}\left(\sum_{i=1}^m I_{[-1+\epsilon,1-\epsilon]}\left(\frac{y_i}{\|x\|_p}\right)\right) = \sum_{i=1}^m \mathrm{Var}\left(I_{[-1+\epsilon,1-\epsilon]}\left(\frac{y_i}{\|x\|_p}\right)\right) \leq \sum_{i=1}^m 1
\end{align}
Simply by the fact that our indicator function can only take values $0,1$. Similarly,
\begin{align}
\mathrm{Var}\left(\sum_{i=1}^m I_{[-1-\epsilon,1+\epsilon]}\left(\frac{y_i}{\|x\|_p}\right) \right) \leq m
\end{align}
Now, we can apply Chebyshev's inequality to each of these estimates for our $y_i$ counts to show that, with appropriately chosen $m$, with high probability (i.e. $>2/3$):
\begin{align*}
\left(1-\epsilon\right)\|x\|_p < median |y| < \left(1+\epsilon\right)\|x\|_p
\end{align*}
\end{proof}
So, this algorithm works. However, our $\Pi$ consisted of $mn$ random numbers drawn from $D_p$. That requires too much space to store. How do we ``derandomize" this algorithm?
\subsection{Pseudorandomness}
General approach: given a short truly random bit string, generate a longer ``pseudorandom" string that looks random to some class of algorithms (the algorithms behavior doesn't change by much).
\\\\Can we replace $\{Z_{i,1}, \ldots, Z_{i,n}\}$ with a pseudorandom string? We can by using Nisan's Pseudorandom Generator (PRG) for space-bounded algorithms, which is described in \cite{Nisan92}. \textit{Trivia}: Pseudorandom was actually misspelled as ``Psuedorandom" in the title of Nisan's original conference paper (corrected for the journal version)...
\subsubsection{Nisan's PRG}
Prof. Nelson had a nice diagram in class that unfortunately my \LaTeX\, skills aren't currently able to reproduce. I'll try to get something together and put up a change soon.
\\\\Anyway, imagine we have a program that runs using at most $s$ bits of space and $R$ random bits. Then, the program has at most $2^s$ possible memory states, $\{S_1, S_2, \ldots, S_{2^s}\}$. Let's call the state of the program just before reading in our first random bit, $r_1$, ``Layer 1" or $L_1$. $L_1 \in \{S_1, \ldots, S_{2^s}\} $. $L_i$ will be the state of the program just before reading in random bit $r_i$. If we use at most $R$ random bits, there are $R$ such layers and also a final layer, $L_{R+1}$, which is the state of the program just after reading in the last random bit. At any layer, $L_a$, in our execution, if we're at state $S_i$, based on the next random bit, $L_{a+1}$ will equal 1 of 2 possible other states, $S_j, S_k$. Note that $j$ and $k$ don't have to be distinct (maybe even after reading a random bit the execution is deterministic) and either can equal $i$ (our memory contents might not change at all). Based on this model, our $R$ random bits will induce a distribution on the final layer, $L_{R+1} \in \{S_1, \ldots, S_{2^s}\}$ of our execution.
\begin{theorem}
\label{thm:nisan}
(\cite{Nisan92}) - For any $R \le 2^{cs}$ for some constant $c>0$, there is a PRG $h:\{0,1\}^{O(s\log(R/s))} \to \{0,1\}^R$ such that the final distribution on $L_{R+1}$ when using the pseudorandom bits is indistinguishable from the distribution when using $R$ truly random bits with probability more than $1/2^s$. That is to say, for any function $f:\{0,1\}^s\rightarrow \{0,1\}$, if $x$ is uniform in $\{0,1\}^R$ and $y$ is uniform in $O(s\log(R/s))$, then
$$
\left|\Pr(f(x) = 1) - \Pr(f(h(y)) = 1)\right| \le \frac{1}{2^s} .
$$
\end{theorem}
What does such a generator look like? Say we start with a string $x\in\{0,1\}^s$ of truly random bits. Place $x$ at the root of a tree. In this tree, every left child of a node will simply equal the bit string at that node. To find right children, choose a random $h_i \in \mathcal{H}_{poly-2}: [2^s]\to[2^s]$ for each level of the tree. Hash your bit string with this function and place it at your right child. So, the first few levels of our tree would look like:
\begin{center}
\begin{tikzpicture}[level distance=1.5cm,
level 1/.style={sibling distance=3cm},
level 2/.style={sibling distance=1.5cm}]
\node {$x$}
child {node {$x$}
child {node {$x$}}
child {node {$h_2(x)$}}
}
child {node {$h_1(x)$}
child {node {$h_1(x)$}}
child {node {$h_2(h_1(x))$}}
};
\end{tikzpicture}
\end{center}
At each leaf we have $s$ bits, so to get $R$ pseudorandom random bits, we $R/s$ leaves $\rightarrow \log(R/s)$ levels $\rightarrow \log(R/s)$ hash functions. Recall from Lecture 2 that each hash function requires $s$ random bits $\rightarrow O(s\log(R/s))$ random bits total.
\subsubsection{Algorithm to Fool}
So how do we know how much pseudorandomness we can use in Indyk's algorithm? If we need $R$ random bits total, how many truly random bits do we need to pass into Nisan's PRG to get an acceptable string of $R$ pseudorandom bits out? Well, lets consider a new algorithm, ``ATF", the Algorithm to Fool. This will be an algorithm that checks Indyk's algorithm to determine whether or not is was actually given random bits (as opposed to pseudorandom bits). If we can fool it with a certain level of pseudorandomness, then we're fine to pass our pseudorandom bits into Indyk's algorithm.
\\\\Since it's a checker algorithm, we can assume that we already know the true value of $\|x\|_p$.
\begin{enumerate}
\item {\em} Initialize 3 counters $curr = c_1= c_2 = 0$
\item {\em} Initialize 2 more counters $i\in[m]$ and $j\in[n]$ to 1
\item {\em} Maintain $\left\langle Z_i, x \right\rangle$ in $curr$. Look at entries in $\Pi$ row-by-row from left to right.
\item {\em} For each $i$, at the end of row $i$, if $|curr| \leq (1-\epsilon)\|x\|_p$, $c_1 \leftarrow c_1 + 1$.\\
If $|curr| \leq (1+\epsilon)\|x\|_p$, $c_2 \leftarrow c_2 + 1$.
\end{enumerate}
So, after running ATF:
\begin{align}
c_1 \rightarrow \text{ for how many $i$ was } |y_i| \leq (1-\epsilon)\|x\|_p \\
c_2 \rightarrow \text{ for how many $i$ was } |y_i| \leq (1+\epsilon)\|x\|_p
\end{align}
Indyk's algorithm succeeds iff $c_1 < \frac{m}{2}$ and $c_2 < \frac{m}{2}$.
\\\\Now, the space complexity of ATF is $O(\log n + \log(\text{precision with which we store curr}))$. Additionally, we can store $\Pi$ using $O(n)$ truly random bits.
Based on Theorem \ref{thm:nisan}, we need $\approx s\log(R/s)$ bits of true randomness to fool ATF $\rightarrow O(\log n\log(n)) = O(\log^2(n))$ random bits total.
\section{The $F_p$ Problem: $p > 2$}
This is an algorithm from \cite{Andoni12} based on the works \cite{JST11,AKO11}. The first nearly optimal space algorithm for $p>2$ is due to \cite{IW05}.
In this algorithm, we store a sketch $y = PDx$. P is an $m\times n$ matrix. For each column in $P$, we insert either $\pm 1$ randomly in a random row. So, there will be $n$ non-zero entries, and exactly 1 in each column. Define $n\times n$ matrix $D$ as:
\begin{align}
D=
\begin{bmatrix}
1/u_1^{1/p} & \ldots & 0\\
\vdots & \ddots & \vdots \\
0 & \ldots & 1/u_n^{1/p}
\end{bmatrix}
\end{align}
where each $u_i \sim \mathrm{Exp}$. Recall that, if $X \sim \mathrm{Exp}$:
\begin{align}
\mathrm \Pr(X > t) = \begin{cases}
1 \text{ if }t\leq 0 \\
e^{-t} \text{ if } t >0
\end{cases}
\end{align}
Our estimate of $\|x\|_p$ is:
\begin{align}
\|y\|_\infty = \max_{i = 1,\ldots,m}y_i
\end{align}
If we set:
\begin{align}
m = O\left(n^{1-\frac{2}{p}}\log^{O(1)}n\right)
\end{align}
Then, it is possible to guarantee that our estimate $\|y\|_\infty$ is in $[\frac{1}{4}\|x\|_p, 4\|x\|_p]$ with probability $\geq 2/3$.
\bibliographystyle{alpha}
\begin{thebibliography}{42}
\bibitem[AlonMS99]{AlonMS99}
Noga~Alon, Yossi~Matias, Mario~Szegedy.
\newblock The Space Complexity of Approximating the Frequency Moments.
\newblock {\em J. Comput. Syst. Sci.}, 58(1):137--147, 1999.
\bibitem[Andoni12]{Andoni12}
Alexandr~Andoni.
\newblock High frequency moments via max-stability.
\newblock Manuscript, 2012. \url{http://web.mit.edu/andoni/www/papers/fkStable.pdf}
\bibitem[AKO11]{AKO11}
Alexandr~Andoni, Robert~Krauthgamer, Krzysztof~Onak.
\newblock Streaming Algorithms via Precision Sampling.
\newblock {\em FOCS}, pgs.\ 363--372, 2011.
\bibitem[Indyk06]{Indyk06}
Piotr~Indyk.
\newblock Stable distributions, pseudorandom generators, embeddings, and data stream computation.
\newblock {\em J. ACM} 53(3): 307--323, 2006.
\bibitem[IW05]{IW05}
Piotr~Indyk, David~P.~Woodruff.
\newblock Optimal approximations of the frequency moments of data streams.
\newblock {\em STOC}, pgs.\ 202--208, 2005.
\bibitem[JST11]{JST11}
Hossein~Jowhari, Mert~Saglam, G\'{a}bor~Tardos.
\newblock Tight bounds for $L_p$ samplers, finding duplicates in streams, and related problems.
\newblock {\em PODS}, pgs.\ 49--58, 2011.
\bibitem[Nisan92]{Nisan92}
Noam~Nisan.
\newblock Pseudorandom Generators for Space-Bounded Computation.
\newblock {\em Combinatorica}, 12(4):449-461, 1992.
\end{thebibliography}
\end{document}