\documentclass[11pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\DeclareMathOperator*{\E}{\mathbb{E}}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbb{P}}
\newcommand{\eps}{\varepsilon}
\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\R}{\mathbb{R}}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf CS 229r: Algorithms for Big Data } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{conjecture}[theorem]{Conjecture}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
\begin{document}
\lecture{Oct. 10, 2013}{Fall 2013}{Prof.\ Jelani Nelson}{Tom Morgan}
\section{Overview}
In this lecture and the previous one, we focused on finding Johnson-Lindenstrauss matrices that allow for faster multiplication. Last lecture, we presented a construction that performed faster when the vectors to be transformed are sparse. This lecture, we will present a JL matrix that performs faster multiplication with dense vectors.
\section{Fast Johnson-Lindenstrauss Transform}
Here we will present the Fast Johnson-Lindenstrauss Transform of \cite{ac06}. The initial idea is to use a scaled sampling matrix $S\in \R^{m \times n}$. $S$ is all zeros except for each row having a $\sqrt{n/m}$ in a random column. Observe that $\|Sx\|_2^2$ is the scaled empirical norm of $m$ random $x_i^2$s, which gives us our desired expectation since, if we select $r \in [n]$ uniformly at random, $$\E_r x_r^2=\sum_{i=1}^n\Pr(r=i)\cdot x_i^2=\frac{1}{n}\|x\|_2^2.$$ The problem however, is that this may have high variance. For example, if most of $x$'s mass is in one coordinate, then $m$ needs to be on the order of $n$ to expect to hit it. Our solution to this is to precondition $x$ with some other matrix, such that the norms are preserved will spreading the $x$'s mass across the coordinates.
\cite{ac06}'s inspiration for this preconditioner comes from the Heisenberg's uncertainty principal, which states that a vector $x$ and its Fourier transform $\hat{x}$ cannot both be sharply concentrated. However, we cannot just take the Fourier transform since if the initial $x$ is well spread, $\hat{x}$ may become concentrated. \cite{ac06}'s ultimate matrix was $\Pi=PHD$, where $D$ is a matrix with random signs on the diagonal, $H$ is the Fourier matrix, and $P$ is a random super-sparse matrix. With this they achieved a multiplication time between $\Pi$ and $x$ of $O\left(n\log n + \frac{1}{\epsilon^2}\log^2\left(\frac{1}{\delta}\right)\log\left(\frac{n}{\delta}\right)\right)$. We will prove a slightly weaker version of this result by letting $P$ be a slow JL matrix times the sample matrix $S$ described earlier.
\subsection{Main Results}
There are many possibilities for the matrix $H$, other than just the Fourier matrix. It suffices for $H$ to satisfy the following properties:
\begin{enumerate}
\item $Hx$ can be computed in $O(n \log n)$ time.
\item $\forall i,j \; |H_{i,j}|=1/\sqrt{n}$
\item $H$ is an orthogonal matrix ($HH^T = H^T H = I$)
\end{enumerate}
Instead of the Fourier matrix, we will use $H$ such that $H_{i,j}=\frac{(-1)^{\inprod{i,j}}}{\sqrt{n}}$, where $\inprod{i,j}$ is the dot product of the binary representations of $i$ and $j$. We will require that $n$ is a power of 2 (if it is not, simply pad $x$ with 0s in until it is). This $H$ obviously satisfies property 2. To see that it satisfies property 1, observe that if $H_k$ is the $H$ matrix when $n = k$, then
$$H_k=\frac{1}{\sqrt{2}}\left[
\begin{array}{c|c}
H_{k/2} & H_{k/2} \\ \hline
H_{k/2} & -H_{k/2} \\
\end{array}\right].$$
This follows from looking at how the most significant bit of $i$ and $j$ contribute to $\inprod{i,j}$ and thus $H_{i,j}$. Let $i'$ and $j'$ be the $i$ and $j$ after removing the most significant bit. In all but the bottom right quadrant, either $i$ or $j$ have a 0 in their first bit, so $\inprod{i,j}=\inprod{i',j'}$, while in the bottom right quadrant, $i$ and $j$ both have a $1$ first, so $\inprod{i,j}=1+\inprod{i',j'}$. Using this recurrence, we can multiply $Hx$ in time $O(n\log n)$.
To see that this $H$ satisfies property 3, we will verify that $H^TH=I$. Write
$$H=\left[
\begin{array}{ccc}
| & & | \\
h^1 & \cdots & h^n \\
| & & | \\
\end{array}\right].$$ Then $$(H^TH)_{ii}=\inprod{h^i,h^i}=\frac{1}{n}\sum_{r=1}^n\left((-1)^{\inprod{i,r}}\right)^2=\frac{1}{n}\sum_{r=1}^n 1 = 1$$ and for $i \neq j$, $$(H^TH)_{ij}=\inprod{h^i,h^j}=\frac{1}{n}\sum_{r=1}^n(-1)^{\inprod{i\oplus j,r}}=0,$$ where $\oplus$ is the XOR operation. The final equality follows from the fact that since $i$ and $j$ differ in at least one spot, $i\oplus j \neq 0$ and so the same number of $r$s make $\inprod{i\oplus j,r}$ even as make it odd.
Now we will move on to proving that $\Pi$ operates as desired. First, we will show that $HD$ successfully spreads the mass of $x$. Let $y = HDx$ and $z = Sy$.
\begin{lemma}
$\Pr_D\left(\|y\|_\infty \geq c \sqrt{\frac{\ln(2n/\delta)}{n}}\right) < \frac{\delta}{2}$.
\end{lemma}
\begin{proof}
Let $\alpha_i = \pm 1$ be the $i$th diagonal entry of $D$. We will begin by bounding the $p$th moment of $y_i$ so that we can apply Markov's inequality to it. We will do this once again by inserting random Gaussians $g_1\cdots g_n$.
\begin{align*}
\|y_i\|_p &= \sum_{r=1}^n \alpha_r (h_{ir}x_r) \\
&= \sqrt{\frac{\pi}{2}}\left|\left|\E_g \sum_{r=1}^n\alpha_r |g_r|(h_{ir}x_r)\right|\right|_p \\
&\leq \sqrt{\frac{\pi}{2}}\left|\left|\sum_{r=1}^n\alpha_r |g_r|(h_{ir}x_r)\right|\right|_p \\
&= \sqrt{\frac{\pi}{2}}\left|\left|\sum_{r=1}^n\ g_r(h_{ir}x_r)\right|\right|_p \\
&= \sqrt{\frac{\pi}{2}}\sqrt{\sum_r(h_{ir}x_r)^2}\|g\|_p \\
&\lesssim \sqrt{\frac{p}{2}}\|x\|_2 \\
\end{align*}
Where the first inequality follows from Jensen's inequality and the final equality follows from the 2-stability of Gaussians. As an aside, the property that for random signs $\sigma_i$, $$\left|\left|\sum_i\sigma_i x_i\right|\right|_p \lesssim \sqrt{p} \|x\|_2$$ is known as Khintchine's inequality.
We can now apply Markov's inequality generalized to the $p$th moment $$\Pr(|y_i|>\lambda)<\lambda^{-p}\E |y_i|^p$$ obtaining
$$
\Pr\left(|y_i|>c\sqrt{\frac{\ln(2n/\delta)}{n}}\right) < \left(\sqrt{\frac{p}{n}}\frac{1}{c}\sqrt{\frac{n}{\ln(2n/\delta)}}\right)^p = \left(\sqrt{\frac{p}{c^2\ln(2n/\delta)}}\right)^p
$$
Which by setting $p = c^2 \ln(n/\delta)/e^2$ and $c = e$ is bounded by $$e^{-p}=e^{-\frac{c^2}{e^2}\ln(2n/\delta)}\leq\frac{\delta}{2n}$$
to which we can simply apply a union bound to prove our claim.
\end{proof}
Now we know that $HD$ spreads the mass of $x$ as desired, and it is easy to see that $\|y\|_2=\|x\|_2$. All that remains is to argue that $S$ preserves the norms of unit $y$ vectors.
\begin{lemma}
Conditioned on the ``good'' event that $\|y\|_\infty \lesssim \sqrt{\frac{\ln(2n/\lambda)}{n}}$, $\Pr\left(\left\||Sy\|^2-1\right|\leq\epsilon\right) \geq 1-\delta/2$.
\end{lemma}
\begin{proof}
We will be using the following Chernoff bound. Given independent random variables $X_1,\ldots,X_n$, $X=\sum_i X_i$, $\mu =\E X$, $\sigma^2=\E(X-\E X)^2$, and $|X_i| \leq K$ with probability $1$, $$\Pr(|X-\mu|>\lambda) \lesssim \max\left\{e^{-c\lambda^2/\sigma^2},e^{-c\lambda/K}\right\}.$$
We will write $S_{ri}=\sqrt{n/m}\delta_{ri}$ where $\delta_{ri}\in\{0,1\}$ is our random sample of the columns for row $r$. Thus, $\forall r, \sum_{i=1}^n \delta_{ri} = 1$. Recall that $z=Sy$. Let
\begin{align*}
q_r&=z_r^2\\
&=\frac{n}{m}\left(\sum_{i=1}^n \delta_{ri}y_i\right)^2\\
&=\frac{n}{m}\left(\sum_i \delta_{ri}y_i^2 + \sum_{i \neq j} \delta_{ri}\delta_{rj}y_i y_j\right)\\
&=\frac{n}{m}\sum_i \delta_{ri}y_i^2.\\
\end{align*}
We care about $z^2=\sum_r q_r$, and the $q_r$s are independent, so we can apply Chernoff if we bound $\sigma^2$ and $K$. $$K\leq \frac{n}{m}\|y\|_\infty^2\lesssim \frac{\ln(n/\delta)}{m}.$$
\begin{align*}
\sigma^2 &\le m \E q_1^2 = m \E\left(\frac{n^2}{m^2}\sum_i \delta_{ir} y_i^4\right) \\
&= \frac{n}{m} \sum_i y_i^4 \\
&\leq \|y\|_\infty^2 \frac{n}{m}\sum_i y_i^2 \\
&= \frac{n}{m} \|y\|_\infty \\
&\lesssim \frac{\ln(n/\delta)}{m} \\
\end{align*}
Plugging this into the Chernoff bound we get
$$\Pr\left(\left\||Sy\|^2-1\right|>\epsilon\right)=\Pr\left(\left|\sum_r q_r-1\right|>\epsilon\right) \lesssim \max\left\{e^{-c\epsilon^2m/\ln(n/\delta)},e^{-c\epsilon m/\ln(n/\delta)}\right\}.$$
The first term dominates, and so we set $m \approx 1/\epsilon^2 \ln(n/\delta)\ln(1/\delta)$ to get the $\delta/2$ bound that we desire.
\end{proof}
Note that our $m$ here is losing a little bit from the desired JL dimension, so after applying $SHD$ we apply a final matrix $T$ which is a normalized random sign matrix with $O(\epsilon^{-2} \log \delta^-1)$ rows (the same kind we used to do slow JL) to fix it up. Thus our final JL matrix is $\Pi=TSHD$. Now let's look at the time it takes to compute $\Pi x$. Multiplying by $D$ takes $O(n)$ time, $H$ takes $O(n\log n)$, $S$ takes $O(m)=O(n)$ time, and $T$ takes $O\left(\frac{1}{\epsilon^2}\log\left(\frac{1}{\delta}\right)\frac{1}{\epsilon^2}\log\left(\frac{1}{\delta}\right)\log\left(\frac{n}{\delta}\right)\right)\approx O(m^3)$ time. So the total time is $O(n \log n + m^3)$.
\subsection{Later Work}
Later work focused on improving the additive $O(m^3)$ component of the time. \cite{al09} reduced it to $O_\gamma(m^{2+\gamma})$. \cite{al11} and \cite{kw11} further reduced it to $O(m)$. They achieved this by using the same $\Pi=SHD$ but using better analysis. In particular, rather than showing that $HD$ spreads mass, they focused on $SH$ which they showed exhibits the isometry property, which means that it preserves the norms of $k$-sparse vectors. These last works however lose a little bit in the dimensionality reduction, in that they have $m=O(\epsilon^{-2} \log N (\log\log N)^3)$. \cite{npw14} maintained the same running time, while improving the dimension to $m=O(\epsilon^{-2} \log N (\log\log N)^2)$.
\section{Other Johnson-Lindenstrauss Work}
So far we have always looked at using linear maps, but can we do better with other maps?
\begin{definition}
For the metric space $(T,\ell_2)$, define the \emph{doubling constant} $\lambda(T)$ as the smallest number $\lambda$ such that any $\ell_2$ ball of radius $r$ can be covered by at most $\lambda$ balls of radius $r/2$ centered in $T$.
\end{definition}
\begin{definition}
The \emph{doubling dimension} of $T$ is $d(T)=\log_2(\lambda(T))$.
\end{definition}
\begin{lemma}
Any embedding of $T$ into $\ell_2^m$ with distortion at most $C$ must have $m \gtrsim d(T)$.
\end{lemma}
\begin{proof}
(Sketch) Take a ball in the original space, and map it into $\ell_2^m$ using the embedding. Cover this embedded ball with a few $(r/f(k))$-balls in $\ell_2^m$, then then take the inverse images of these balls in the original space. They should cover the original ball.
\end{proof}
\begin{conjecture}
\emph{(\cite{gkl03}, \cite{lp01})} For any $T$, it can be embedded into $\ell_2^m$ with distortion $C$ such that $$C \leq f(d(T))$$ $$m \leq f'(d(T)).$$ For some functions $f$ and $f'$.
\end{conjecture}
This conjecture cannot be proved with linear maps. For example, here's an example pointed out to me by Huy Nguy$\tilde{\hat{\mbox{e}}}$n (though the existence of similar examples was known even at the time of \cite{gkl03}). Take the set
$$T=\left\{
\begin{array}{l}
x_{ij0}=A(in+j)e_{n+1}+e_i \\
x_{ij1}=A(in+j)e_{n+1}+e_j \\
\end{array}\right.$$
for some large $A$. This set looks like a big line along $e_{n+1}$ with periodic tight clusters of points deviating very slightly from the line. A linear map for this set requires $\log N$ dimensions, because for any linear map we must have $\|\Pi x_{ij0}-\Pi x_{ij1}\|=\|\Pi e_i - \Pi e_j\|$ for which \cite{a03} gives us a bound of $\Omega(\log N)$.
However, if we use the nonlinear mapping $$f(x_{ij0}) = (A(in+j),0,1)$$ $$f(x_{ij1}) = (A(in+j),1,0)$$ then we can achieve constant dimension and distortion.
\bibliographystyle{alpha}
\begin{thebibliography}{NPW12}
\bibitem[AC06]{ac06}
Nir Ailon and Bernard Chazelle.
\newblock Approximate nearest neighbors and the fast johnson-lindenstrauss
transform.
\newblock In {\em Proceedings of the 38th annual ACM symposium on
Theory of computing (STOC)}, pages 557--563. ACM, 2006.
\bibitem[AL09]{al09}
Nir Ailon and Edo Liberty.
\newblock Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes.
\newblock {\em Discrete \& Computational Geometry}, 42(4): 615--630, 2009.
\bibitem[AL13]{al11}
Nir Ailon and Edo Liberty.
\newblock An almost optimal unrestricted fast johnson-lindenstrauss transform.
\newblock {\em ACM Transactions on Algorithms (TALG)}, 9(3):21, 2013.
\bibitem[Alo03]{a03}
Noga Alon.
\newblock Problems and results in extremal combinatorics—i.
\newblock {\em Discrete Mathematics}, 273(1):31--53, 2003.
\bibitem[GKL03]{gkl03}
Anupam Gupta, Robert Krauthgamer, and James~R Lee.
\newblock Bounded geometries, fractals, and low-distortion embeddings.
\newblock In {\em Proceedings of the 44th Symposium on Foundations of Computer Science (FOCS)}, pages 534--543. IEEE, 2003.
\bibitem[KW11]{kw11}
Felix Krahmer and Rachel Ward.
\newblock New and improved Johnson-Lindenstrauss embeddings via the restricted
isometry property.
\newblock {\em SIAM Journal on Mathematical Analysis}, 43(3):1269--1281, 2011.
\bibitem[LP01]{lp01}
Urs Lang and Conrad Plaut.
\newblock Bilipschitz embeddings of metric spaces into space forms.
\newblock {\em Geometriae Dedicata}, 87(1-3):285--307, 2001.
\bibitem[NPW14]{npw14}
Jelani Nelson, Eric Price, and Mary Wootters.
\newblock New constructions of rip matrices with fast multiplication and fewer
rows.
\newblock In {\em Proceedings of the 35th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)}, 2014.
\end{thebibliography}
\end{document}