\documentclass[11pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{framed}
\usepackage{cancel}
\usepackage{tabularx}
\DeclareMathOperator*{\E}{\mathbb{E}}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbb{P}}
\newcommand{\eps}{\varepsilon}
\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\R}{\mathbb{R}}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf CS 229r: Algorithms for Big Data } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
\begin{document}
\lecture{11--- Oct. 8, 2013}{Fall 2013}{Prof.\ Jelani Nelson}{Arthur Safira}
\section{Overview}
In this unit we have been focusing on dimensionality reduction with distortion as its figure of merit. In the last lecture we proved Lipschitz-concentration and the Decoupling lemma, and discussed Alon's lower bound for dimensionality reduction with small distortion ($m \gtrsim \frac{1}{\eps^2 \log{\frac{1}{\eps}}}\log{N}$).
In this lecture we will talk about more efficient dimensionality reduction -- more efficient in terms of \textbf{time}.
\subsection{Analyzing the time efficiency of JL}
The Johnson-Lindenstrauss Lemma \cite{JohnsonLindenstraussCM84} lets us transform high dimensional data to a lower dimension to run our algorithms faster:
$$ T(N, n) \overbrace{\rightarrow}^{\text{dim reduction}} T(N, O(\frac{1}{\eps^2} \log (N) ) ) + \text{Time to perform dim reduction}$$
where $T(N, n)$ denotes the time of the algorithm given $N$ vectors of dimension $n$. Our goal in this lecture will be to do the best we can choosing a $\Pi$ such that the time to perform the dimensionality reduction is minimized.
\section{Review of our previous choice(s) for $\Pi$}
Our previous $\Pi$ had the following:
\begin{itemize}
\item $\Pi \in \mathbb{R}^{n \times m}$
\item $\Pi_{i,j} = \frac{\pm 1}{\sqrt{m}}, \text{ each independent}$
\item $\text{time}(\Pi x) = O(mn)$
\begin{itemize}
\item More carefully, we can write the time complexity as $O(m \|x\|_0)$, where $\|x\|_0$ is the size of the support of $x$. This might seem nit-picky, but it is not uncommon for $x$ to be quite sparse. For example, some effective text processing machine learning algorithms keep vectors of dimension $D = $ number of words in the dictionary, and keep track of the frequency of words in an e-mail to discern whether or not a given message is spam.
\end{itemize}
\end{itemize}
In the previous lecture, although we proved JL with $\Pi$ as described above, we mentioned many other choices would work, too, such as one with independent random entries from $N(0,1)$ (We also gave some more general conditions for what probability distributions could be used for our matrices --- check out the previous lecture for more info). None of these, however, help the above runtime.
\section{ Let's find a better $\Pi$}
It's clear that one way we could win is doing a better job in choosing $\Pi$; it would be great if we could choose one that had more $0$'s so that we would be multiplying numbers a a whole lot less.
This and next lecture we will have two approaches for two different cases:
\begin{enumerate}
\item Choosing $\Pi$ to be fast for sparse vectors
\item Choosing $\Pi$ to be fast for dense vectors
\end{enumerate}
\subsection{What $\Pi$ is fast for sparse vectors?}
As just hinted at, the best way we know to deal with sparse vectors is to make $\Pi$ itself a sparse matrix.
\paragraph{History Table}
\begin{center}
\begin{tabularx}{\textwidth}{c | c | c | l}
\hline
\textbf{Reference} & \textbf{$m$} & \textbf{$s$} &\textbf{ Notes}\\ \hline \hline
JL \& others \cite{JohnsonLindenstraussCM84} & $\approx\frac{4}{\eps^2} \ln (1/\delta)$ & $m$ & Last Lecture \\ \hline
Achlioptas \cite{AchlioptasPODS01}& $\approx \frac{4}{\eps^2} \ln (1/\delta)$ & $m/3$ & Random Sign matrix with 2/3 \\
& & & prob of zero'ing each matrix \\ & & & element; $m/3$ is an expectation. \\ \hline
Thorup \& Zhang \cite{ThorupZhangSJC12} & $\frac{1}{\eps^2 \delta}$ & 1 & First Problem Set \\ \hline
Dasgupta, Kumar, \& Sarl\'{o}s \cite{DasGuptaKumarSarlosSTOC10} & $O( \frac{1}{\eps^2} \ln (1/\delta)$ & $\tilde{O}(\frac{1}{\eps} \log^2 (1/\delta))$\footnote{\cite{DasGuptaKumarSarlosSTOC10} showed $s = \tilde{O}(\eps^{-1}\log^3(1/\delta))$, but tighter analyses were later given of the same construction improving the cubic dependence to quadratic \cite{BOR10,KN10}.} & $\tilde{O}(\cdot)$ hides $\log^{O(1)}(1/\eps)$\\ \hline
Kane, Nelson \cite{KaneNelsonSODA12}& $O( \frac{1}{\eps^2} \ln (1/\delta)$ & $O(\frac{1}{\eps} \log(1/\delta))$ &\\ \hline
\end{tabularx}
\end{center}
where $m$ is the target dimension and $s$ is number of non-zero entries per column of the matrix. With $s$ non-zero entries, we could multiply $\Pi x$ in time $\text{time}(\Pi x) = O(s \|x|_0)$.
What is $\Pi$? We have two options:
\begin{enumerate}
\item $\Pi_1$: Split each column of $\Pi$ into $m/s$ blocks of size $s$. For each of these blocks, choose exactly one entry to be a (normalized) random sign ($\sigma = \pm 1 /\sqrt{s}$), and set the rest of the matrix elements in the block to be 0.
\item $\Pi_2$: For each column, choose $s$ entries (\textbf{without replacement}) to place a random sign r.v. (again, $\sigma = \pm 1/ \sqrt{s}$).
\end{enumerate}
As far as implementations are concerned, the first of these is a bit simpler as we can simply make use of hash functions $h: [n] \times [s] \to [m/s]$ and $\sigma: [n] \times [s] \to \{ \pm 1 \}$. Dealing with the second one is more of a hassle.
\subsection{Dealing with $\Pi$}
From here onwards, we drop the index.
Quick Notes:
\begin{itemize}
\item For $s = m$, both $\Pi_1$ and $\Pi_2$ are the Thorup Zhang sketch \cite{ThorupZhangSJC12}.
\item For general $s$, $\Pi_1$ is just CountSketch: The matrix describes the hash functions of a particular countsketch matrix with $s$ rows and $m/s$ columns \cite{CharikarChenFarach-ColtonTCS04}.
\end{itemize}
\subsubsection{Analysis}
\begin{framed}
\textbf{Claim}: We can set $m$ and $s$ such that
$$m = O(1/\eps^2 \log(1/\delta)) \; \text{ and } \; s = O(\frac{1}{\eps} \log (1/\delta))$$
and satisfy the usual $(1 \pm \eps)$ distortion properties of the mapping $\Pi$ w.p.\ $1-\delta$.
\end{framed}
Before we prove this claim, we mention some bad news in regard to our efforts here.
\begin{framed}
\textbf{``Bad News" Claim}: For all $N>1$ there exists $N+1=n+1$ vectors in $\R^n$ such that any $\Pi \in \mathbb{R}^{m \times n}, m = O(\frac{1}{\eps^2} \log (N))$ preserving all pairwise Euclidean distances up to $1+\eps$ and with $s$ non-zeros per column must have $s = \Omega( \frac{1}{\eps} \frac{\log(N)}{\log (\frac{1}{\eps})})$ as long as $m = O(\eps^{-2}\log N)$ (and $m = O(\frac{n}{\log{1/\eps}})$) \cite{NelsonNguyenCoRR12}. Note we can't let $m$ get too close to $n$ for the lower bound since once $m=n$ the identity matrix works and has $s=1$.
\end{framed}
In other words, we have a limitation in how many non-zero elements we can reduce down to if we want to reduce the dimension down to $m = O(\frac{n}{\log{1/\eps}})$.
On the bright side, let's move towards proving the claim.
\subsubsection{Proof of Claim}
Note we can write the elements of $\Pi$ as $\Pi_{i,j} = \frac{1}{\sqrt{s}} \delta_{i,j} \sigma_{i,j}$ where $\sigma_{i,j}$ is a random sign and $\delta_{i,j}\in\{0,1\}$. Without loss of generality, we can set $\|x\|_2 = 1$. Then, each component of $\Pi x$ is given by
\begin{align*}
(\Pi x)_r & = \frac{1}{\sqrt{s}} \sum_{i = 1}^n \delta_{r, i} \sigma_{r, i} x_i .
\end{align*}
Thus, we can write the norm squared as
\begin{align*}
\|\Pi x\|^2 & = \frac{1}{s} \sum_{r = 1}^m \sum_{i,j = 1}^n \delta_{r, i} \delta_{r, j} \sigma_{r, i} \sigma_{r, j} x_i x_j \\
& = \frac{1}{s} \sum_{r = 1}^m \left ( \sum_{i = 1}^n \delta_{r, i}^2 \cancel{ \sigma_{r, i}^2} x_i^2 +\sum_{\substack{i,j = 1 \\ i \neq j }}^n \delta_{r, i} \delta_{r,j} \sigma_{r, i} \sigma_{r, j} x_i x_j \right ) \\
\|\Pi x\|^2 - 1 & = \overbrace{\frac{1}{s} \sum_{r = 1}^m \sum_{\substack{i,j = 1 \\ i \neq j }}^n \delta_{r, i} \delta_{r,j} \sigma_{r, i} \sigma_{r, j} x_i x_j }^{Z}
\end{align*}
Where we used the fact that
$$ \frac{1}{s} \sum_{r =1}^m \sum_{i = 1}^n \delta^2_{r,i} x_i^2 = \frac{1}{s} \sum_{i = 1}^n \sum_{r =1}^m \delta^2_{r,i} x_i^2 = \frac{1}{\cancel{s}} \sum_{i = 1}^n \cancel{s} x_i^2 = 1$$
since $\|x\| = 1$ and we have that there are exactly $s$ non-zero elements per column.
Not that at this point we have brought the issue down to characterizing $Z$. We can interpret $Z$ as an error for how far off $\|\Pi x \|$ is from 1. We would like to show something like
$$ \Pr( |Z| > \eps) < \delta . $$
Let's write
$$ \|\Pi x\|^2 - 1 = \sigma^T A_x \sigma - \mathbb{E}(\sigma^T A_x \sigma) .$$
Note that $\mathbb{E}(\sigma^T A_x \sigma) $ will exactly be the sum of diagonal terms (convince yourself!). This situation should look really familiar now; we had a similar (but not exactly the same!) situation last lecture. In the last lecture, we had $A_x$ being a block diagonal matrix with sub matrices $x x^T$ along the diagonal. We \emph{almost} have this, but since so many of our entries are actually 0, we actually have block sub matrices of the form $x^{(r)} x^{(r) T}$ along the diagonal (but where we zero out the diagonal entries of the matrix), with
$$ x^{(r)} = (\delta_{r, 1} x_1, ... , \delta_{r, n} x_n )^T $$
How are we going to prove the tail bound on $\| \Pi x\|^2 - 1$? The same was as we did it last lecture, using the Hanson-Wright inequality:
$$ \Pr ( \| \Pi x\| - 1 > \eps) = \Pr( | \sigma^T A_x \sigma - \mathbb{E} \sigma^T A \sigma | > \eps) \lesssim \exp \left ( - \min \left \{ \frac{c \lambda^2}{\|A\|_F^2} , \frac{c \lambda}{\|A\|} \right \} \right ) $$
Let's define a ``good" event $E$ as one such that for all $i \neq j \in [n], \sum_{r=1}^m \delta_{r,i} \delta_{r,j} = O(s^2/m)$. What this encapsulates is the number of times two separate columns will "collide"; that is, if we place $s$ non-zero elements in our matrix in each of the columns $i$ and $j$, we expect that for each non-zero element that the probability to collide with it is $s/m$, and we have $s$ opportunities to do this so that the expectation is $O(s^2/m)$. We expect this to be true, but our proof will actually depend on it being true; as such, we will later place bounds on the probability we have more than (for example) 5 times more collisions than this expectation include that in our analysis.
With the good event as above, we can write
\begin{align*}
\|A_x\|_F^2 & = \frac{1}{s} \sum_{r = 1}^m \sum_{i \neq j} \delta_{r, i} \delta_{r, j} x_i^2 x_j^2 \\
& = \frac{1}{s^2} \sum_{i \neq j} x_i^2 x_j^2 \sum_{r = 1}^m \delta_{r, i} \delta_{r, j} \\
&\leq O(\cancel{s^2}/m)\frac{1}{\cancel{s^2}} \overbrace{\sum_{i \neq j} x_i^2 x_j^2}^{\leq (\sum x_i^2)^2 = 1} \\
&\leq O(\frac{1}{m})
\end{align*}
Whew. Let's consider the operator norm now, $\|A_x\|$. The matrix $A_x$ has the form of a block diagonal matrix; we can write
$$\|A_x\| = \frac{1}{s} \max_{r \in [m]} \|A_r\|.$$
where we pulled out the overall factor of $1/s$. We can bound this by noticing $A_r = S_r - D_r$, with $S_r = x^{(r)}x^{(r)T}$ and $D_r$ diagonal with entries $\delta_{r, 1} x_1^2, ..., \delta_{r_n} x_n^2$. We can write
$$ \overbrace{\|A_r \| \leq \|S_r\| + \|D_r\|}^{\text{triangle ineq.}} \leq 1 + 1 = 2 $$
since
$$ \|D_r\| = \max_i \delta_{r,i} x_i^2 \leq \|x\|_\infty^2 \leq 1$$
and $\|S_r\|$ has only one eigenvector with non-zero eigenvalue, $x^{(r)}$ ,so that $\|S_r\| = \|x^{(r)}\|^2 \leq \|x\|^2 = 1$. Thus, $\|A_x\| \leq 2/s$.
Putting this all together, we can write that, \textbf{conditioned on E},
$$ \Pr_\sigma ( \| \Pi x\|^2 -1 > \eps ) \leq \max \left \{ \exp(- c \eps^2 m), \exp (- c \eps (s/2)) \right \}$$
So that we can choose $m = \Theta ( \frac{1}{\eps^2} \log (1/\delta))$ and $s = \Theta ( \frac{1}{\eps^2} \log (1/\delta) )$. Remember, however, that we aren't done; we need to deal with the event we conditioned on in order to get the bound on $m$!
\subsubsection{Analysis of the event $E$}
Fix $(i,j)$ and suppose $\delta_{r_1, i}, ..., \delta_{r_s, i} = 1$. Define indicator r.v.'s $X_1, ..., X_s$, with
$$ X_k = \mathbf{1}_{\{\delta_{r_k, j} = 1 \} } .$$
Then,
$$ \sum_{i = 1}^m \delta_{r, i} \delta_{r, j} = \sum_{k = 1}^s X_k $$
Now we can apply the Chernoff bound to get error probability $\gamma = \delta/n^2$ as long as $s^2/m \geq c \log (1/\gamma)$, or
$$ s \gtrsim \sqrt{m \log(n/\delta) } = \frac{1}{\eps} \sqrt{ \log(1/\delta) \log(n/\delta) } $$
This setting of $\gamma$ is to ensure, by a union bound, that no pair $i\neq j$ has too many collisions (and there are $\binom{n}{2}$ such pairs). Finally, we can first apply the Thorup-Zhang matrix with $s=1$ and $m = O(1/\eps^2 \delta)$; then we can write out effective $\Pi$ to be the $\Pi$ we have been describing all long multiplying the TZ matrix. In doing so, we can replace $n$ by $O(1/\eps^2 \delta)$. To remove the $\sqrt{\log(1/\eps)}$,
$$ \|\Pi x\|^2 -1 = Z = \frac{1}{s} \sum_{r = 1}^m \sum_{i \neq j} \delta_{r,i} \delta_{r,j} \sigma_{r,i} \sigma_{r, j} x_i x_j = \frac{1}{s} \sum_{r = 1}^m Z_m$$
We then bound the probability that $Z$ is large with
\begin{align*}
\Pr(Z > \eps ) & = \Pr(Z^\ell > \eps^\ell ) \\
& < \frac{1}{\eps^\ell} \mathbb{E} (Z^\ell) \\
& = \frac{1}{\eps^\ell s^\ell} \sum_{q = 1}^\ell \sum_{\substack{r_1, ..., r_q \\ \ell_1, ..., \ell_q \\ \sum \ell_i = \ell}} \binom{\ell}{\ell_1, .., \ell_q} \mathbb{E} \left ( \prod_{j = 1}^q Z_{r_j}^{\ell_j} \right )
\end{align*}
where we simply expanded into all of the terms when u expand $Z^\ell = ( (1/s) \sum Z_r)^\ell$ and we used linearity of expectation. Without going into all the details, it turns out we can write $\mathbb{E} ( \prod_{j = 1}^q Z_{r_j}^{\ell_j} ) \leq \prod_{j=1}^q \mathbb{E} (Z_{r_j}^{\ell_j})$, and the proof boils down to bounding this last expectation.
\section{Moral}
One moral of this story is that in a proof, you want to argue as much as you can without using too many ``black boxes". In our proof here,the Hanson-Wright inequality is a black box that we used to bound our error. In the proof of this last part, we proved the bounds on the expectation from first principles \cite{KaneNelsonSODA12}. In fact it is possible to get the correct bound using Hanson-Wright, but you should not condition on event $E$. Rather you should show using Markov's inequality on a high moment to show that the Frobenius norm squared $\|A_x\|_F^2$ is small with high probability, but then the calculations you end up doing become essentially identical to just reasoning about the moments of $Z$ itself from first principles.
If a paper uses a lot of ``black boxes'' to reach their results, it's not unreasonable to consider that there might be better result if one did not use such heavy machinery and instead solved things using only first-principles tools. Or at least, understand when each of the black boxes is or isn't tight in various applications (maybe something too powerful is being used and is thus causing suboptimal results).
\bibliographystyle{alpha}
\begin{thebibliography}{42}
\bibitem{AchlioptasPODS01}
Dimitris~Achlioptas.
\newblock Database-friendly random projections.
\newblock {\em J. Comput. Syst. Sci.} 66(4): 671--687, 2003.
\bibitem{BOR10}
Vladimir~Braverman, Rafail~Ostrovsky, Yuval~Rabani.
\newblock Rademacher Chaos, Random Eulerian Graphs and The Sparse Johnson-Lindenstrauss Transform.
\newblock {\em CoRR} abs/1011.2590, 2010.
\bibitem{CharikarChenFarach-ColtonTCS04}
Moses~Charikar, Kevin~Chen, and Martin~Farach-Colton.
\newblock Finding frequent items in data streams.
\newblock {\em Theor. Comput. Sci., 312} (1):3--15, 2004.
\bibitem{DasGuptaKumarSarlosSTOC10}
Anirban~Dasgupta, Ravi~Kumar, Tam‡s~Sarl—s.
\newblock A sparse Johnson-Lindenstrauss transform.
\newblock {\em STOC}, pgs.\ 341--350, 2010.
\bibitem{JohnsonLindenstraussCM84}
William~B.~Johnson, Joram~Lindenstrauss.
\newblock Extensions of Lipschitz mappings into a Hilbert space
\newblock Contemporary Mathematics 26, 1984.
\bibitem{KN10}
Daniel~M.~Kane, Jelani~Nelson.
\newblock A Derandomized Sparse Johnson-Lindenstrauss Transform.
\newblock {\em CoRR} abs/1006.3585, 2010.
\bibitem{KaneNelsonSODA12}
Daniel~M.~Kane, Jelani~Nelson.
\newblock Sparser Johnson-Lindenstrauss transforms.
\newblock {\em SODA}, pgs. 1195--1206, 2012.
\bibitem{NelsonNguyenCoRR12}
Jelani~Nelson, Huy~L.~Nguy$\tilde{\hat{\mbox{e}}}$n
\newblock Sparsity Lower Bounds for Dimensionality Reducing Maps.
\newblock {\em STOC}, pgs.\ 101--110, 2013.
\bibitem{ThorupZhangSJC12}
Mikkel~Thorup, Yin~Zhang.
\newblock Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation.
\newblock { \em SIAM J. Comput.} 41(2): 293--331, 2012.
\end{thebibliography}
\end{document}