\documentclass[11pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\DeclareMathOperator*{\E}{\mathbb{E}}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbb{P}}
\DeclareMathOperator{\Var}{Var}
\newcommand{\eps}{\varepsilon}
\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\lVert #1\right\rVert}
\newcommand{\abs}[1]{\left\lvert #1\right\rvert}
\newcommand{\avg}[1]{\left\langle #1\right\rangle}
\newcommand{\floor}[1]{\left\lfloor #1\right\rfloor}
\newcommand{\ceil}[1]{\left\lceil #1\right\rceil}
\newcommand{\de}[2]{\dfrac{d#1}{d#2}}
\newcommand{\inp}[2]{\left\langle#1,#2\right\rangle}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf CS 229r: Algorithms for Big Data } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
\begin{document}
\lecture{13 --- October 15, 2013}{Fall 2013}{Prof.\ Jelani Nelson}{Sam Elder}
\section{Introduction and Overview}
We're starting a new topic: Numerical linear algebra. We'll see some of related topics, too. We'll spend about four lectures on this before moving to compressed sensing.
Today, we'll look at approximate matrix multiplication. Next lecture we'll get to (oblivious) subspace embeddings and least squares regression. For all of these, we'll develop randomized approximate algorithms that compute what they're supposed to compute faster than deterministic exact algorithms for the problems.
Let's start by talking about matrix multiplication. We have matrices $A$, which is $n\times r$, and $B$, which is $n\times p$, and we want to compute $A^TB$. (This awkward formulation will make some later lemmas seem natural.)
The standard algorithm takes time $rnp$, which just has three for loops. You can do faster. If $A$ and $B$ are both square, i.e. $r=n=p$, then you can do it in time $O(n^\omega)$, where $\omega<2.373\dotsc$. Strassen \cite{Strassen} was the first one doing better, and he got $O(n^{\log_27})$ by some clever recursion. Coppersmith and Winograd \cite{CW87} in the 1980s got a better exponent, and there have been some recent improvements by Williams \cite{Williams} in 2011 and Stothers \cite{Stothers} in 2010. No one uses these, and people still only occasionally use Strassen.
\section{Approximate Matrix Multiplication}
That's square matrices and exact computation, so let's look at approximate matrix multiplication. What do we want?
We want to quickly find matrices $\tilde A$, which is $m\times r$ and $\tilde B$ which is $m\times p$, where $m\ll n$, such that $\norm{\tilde A^T\tilde B-A^TB}$ is small. In this lecture, that norm will be the Frobenius norm, and small will be $\epsilon\norm A_F\norm B_F$. Recall that $\norm X_F=(\sum_{i,j}X_{ij}^2)^{1/2}$ (treat the matrix as a vector and take its $\ell_2$ norm).
\subsection{Sampling Algorithm}
We'll look at two algorithms for matrix multiplication. The first one will use sampling and is due to Drineas, Kannan, and Mahoney \cite{DKM}. Here's the approach: Notice that if $A$ consists of rows $x_1^T,\dotsc,x_n^T$ and $B$ consists of rows $y_1^T,\dotsc,y_n^T$ (so each of these vectors are column vectors), then $A^TB=\sum_{k=1}^nx_ky_k^T$. Indeed,
\[(A^TB)_{ij}=\sum_kA_{ki}B_{kj}=\sum_{k=1}^n(x_k)_i(y_k)_j=(\sum_{k=1}^nx_ky_k^T)_{ij}.\]
The idea is that we have a these $n$ outer products, and we'll only sample $m\ll n$ of them. We won't use the uniform distribution, though: We'll choose $\displaystyle\tilde A^T\tilde B=\frac1m\sum_{t=1}^m\frac{x_{k_t}y_{k_t}^T}{p_{k_t}}$, where $p_k$ is the probability that we sample out product $x_ky_k^T$.
In terms of matrices, $\tilde A=\Pi A$ and $\tilde B=\Pi B$, where $\Pi$ has exactly one nonzero entry in each row, chosen according to some distribution. If we pick a column $k$ with probability $p_k$, then we'll put a $1/\sqrt{mp_k}$ entry in that row. The probability distribution isn't uniform; we'll take $p_k\propto\norm{x_k}_2\norm{y_k}_2$. Notice that if you were to implement this as a streaming algorithm, you'd need two passes over the data, the first one to determine the $p_k$, and the second to pick them with that probability. Perhaps it could be solved with some variant of reservoir streaming though.
The claim is that this is going to be an unbiased estimator and that its variance is small. We can't exactly improve this probability by doing a bunch of trials and taking the median since these are matrices, but there is a way we can boost it anyways.
So we have $\tilde A^T\tilde B=\frac1m\sum_{t=1}^m\frac{x_{k_t}y_{k_t}}{p_{k_t}}$. Call the summand $Z_t$. We'll look at $\E Z_t$, and that'll tell us about the expectation of this sum by linearity of expectation. We have
\[\E Z_t=\frac1m\sum_{k=1}^n\frac{\Pr(k_t=k)x_ky_k^T}{p_k}=\frac1mA^TB,\]
so the expectation is correct. Now the variance calculation. We have
\begin{align*}
\E\norm{\tilde A^T\tilde B-A^TB}_F^2&=\sum_{i=1}^r\sum_{j=1}^p\E((\tilde A^T\tilde B)_{ij}-(A^TB)_{ij})^2\\
&=\sum_{i,j}\Var\left(\sum_{t=1}^m(Z_t)_{ij}\right)=\sum_{i,j}m\Var((Z_t)_{ij})
\end{align*}
since the $Z_t$ are independent. And we have
\allowdisplaybreaks
\begin{align*}
\Var((Z_t)_{ij})&\le\E(Z_t)_{ij}^2=\frac1{m^2}\sum_{k=1}^n\frac{p_k(x_k)_i^2(y_k)_j^2}{p_k^2}\\
\E\norm{\tilde A^T\tilde B-A^TB}_F^2&\le\frac1m\sum_{i,j}\sum_{k=1}^n\frac{(x_k)_i^2(y_k)_j^2}{p_k}\\
&=\frac1m\sum_{k=1}^n\frac1{p_k}\norm{x_k}^2\norm{y_k}^2\\
&=\frac1m\left(\sum_{k=1}^n\norm{x_k}\norm{y_k}\right)^2\text{ (substituting the definition of $p_k$)}\\
&\le\frac1m\left(\sum_{k=1}^n\norm{x_k}^2\right)\left(\sum_{k=1}^n\norm{y_k}^2\right)\text{ (Cauchy-Schwarz)}\\
&=\frac1m\norm A_F^2\norm B_F^2.
\end{align*}
Now when we apply Chebyshev, we get
\[\Pr_\Pi\left(\norm{\tilde A^T\tilde B-A^TB}_F>\epsilon\norm A_F\norm B_F\right)<\frac{\E\norm{\tilde A^T\tilde B-A^TB}_F^2}{\epsilon^2\norm A_F^2\norm B_F^2}<\frac1{\epsilon^2m}.\]
In conclusion, we only need to sample $m=O(1/\epsilon^2)$ outer products to get success with probability $9/10$.
So we said we could bootstrap the failure probability to $\delta$ efficiently. We could just make $m=O(1/\epsilon^2\delta)$ but we can also do it more efficiently, a $\log(1/\delta)$ multiplier.
We'll see a trick due to Clarkson and Woodruff \cite{CWSTOC} from STOC '09 (quite recent!). We'll pick $\Pi_1,\dotsc,\Pi_t$ where $t=O(\log1/\delta)$ and form $t$ matrix products $\tilde A_1^T\tilde B_1,\dotsc,\tilde A_t^T\tilde B_t$. In the past, we had elements and we took the median, but these are matrices. We'd like to compute $\norm{\tilde A^T\tilde B-A^TB}_F$ and see if that's small, but that would involve computing $A^TB$. So instead, we'll compare them with each other: Pick the first $j$ you find such that
\[\norm{\tilde A_j^T\tilde B_j-\tilde A_i^T\tilde B_i}_F<\frac\epsilon2\norm A_F\norm B_F\]
for more than half of the $i$'s. This is something like choosing a median.
Why does this work? Well, with probability $1-\delta$, more than half of the $j$'s have $\norm{\tilde A_j^T\tilde B_j-A^TB}_F<\frac\epsilon4\norm A_F\norm B_F$, by the Chernoff bound. Then we just use the triangle inequality to know that for all such $i,j$,
\[\norm{\tilde A_j^T\tilde B_j-\tilde A_i^T\tilde B_i}_F\le\norm{\tilde A_j^T\tilde B_j-A^TB}_F+\norm{\tilde A_i^T\tilde B_i-A^TB}_F\le\frac\epsilon2\norm A_F\norm B_F,\]
so any such $j$ will be counted. Could another one trick us? Nope: If more than half of the $i$'s have $\tilde A_i^T\tilde B_i$ close to some $\tilde A_j^T\tilde B_j$, then at least one of these will be close to $A^TB$, and this implies by the triangle inequality that any success will be within $\frac{3\epsilon}4\norm A_F\norm B_F$ of $A^TB$, as desired.
\subsection{Dimensionality Reduction-based Algorithm}
Now let's see another way to do approximate matrix multiplication which is related to something we've seen previously in this class: the Johnson-Lindenstrauss (dimensionality reduction) lemma. Using JL for approximate matrix multiplication was first explored by Sarl\'{o}s \cite{Sarlos06}, but we'll present the definitions and analysis from \cite{KN12} since it obtains sharper results by a logarithmic factor (Clarkson and Woodruff \cite{CWSTOC} also improved the logarithmic factor in the special case where the JL matrix of interest is a scaled random sign matrix).
\begin{definition}A distribution $D$ over $\R^{m\times n}$ is said to have the \emph{$(\epsilon,\delta,p)$-JL moment property} if for every $x\in\R^n$ with $\norm x=1$,
\[\E_{\Pi\sim D}\abs{\norm{\Pi x}^2-1}^p<\epsilon^p\delta.\]\end{definition}
Note that by Markov, this gives us
\[\Pr_\Pi\left(\abs{\norm{\Pi x}^2-1}>\epsilon\right)<\frac1{\epsilon^p}\E_\Pi\abs{\norm{\Pi x}^2-1}^p<\delta.\]
Having a moment bound looks a little bit stronger than just having a tail bound, but often times it isn't. Sometimes you prove some tail bound looking like
\[\forall\epsilon > 0,\ \Pr_{\Pi\sim D}(\abs{\norm{\Pi x}^2-1}>\epsilon)<\exp(-c(\epsilon^2m+\epsilon m)).\]
However, this statement is actually equivalent to $D$ having the $(\epsilon,\exp(-c(\epsilon^2m + \epsilon m)),\min\{\epsilon,\epsilon^2\}m)$-JL moment property for all $\epsilon>0$. One direction is just Markov as above, but for the other direction, you can use integration by parts. Let $Z$ be a nonnegative random variable and $\varphi$ its pdf. Then
\[\E Z^p=-\int_0^\infty \epsilon^p(-\varphi(\epsilon))\,d\epsilon=[\epsilon^p(1-\Phi(\epsilon))]|_0^\infty+p\int_0^\infty \epsilon^{p-1}(1-\Phi(\epsilon))\,d\epsilon=p\int_0^\infty \epsilon^{p-1}(1-\Phi(\epsilon))\,d\epsilon.\]
Note $1-\Phi(\epsilon)$ (where $\Phi$ is the cdf) is just $\Pr(Z > \epsilon)$, thus inserting the tail bound above gives a moment bound. Notice that for the moment bound, you need a tail bound like this for every $\epsilon$, because you need to integrate. We won't focus on this too much, but just remember that tail bounds for every $\epsilon$ are equivalent to moment bounds (of this form) for every $\epsilon$.
\begin{theorem}Suppose that $D$ satisfies the $(\epsilon,\delta,p)$-JL moment property for some $p\ge2$. Then for every $A,B$ with matching numbers of rows,
\[\Pr_{\Pi\sim D}\left(\norm{(\Pi A)^T(\Pi B)-A^TB}_F>3\epsilon\norm A_F\norm B_F\right)<\delta.\]
\end{theorem}
\begin{proof}The idea is that the JLMP implies that you preserve vectors, and we'll show that this implies you preserve dot products, and then that implies that you preserve matrix products. Arguing in terms of moments instead of tail bounds let's us exploit Minkowski's inequality (namely that $\|\cdot\|_p$ satisfies triangle inequality). Arguing in terms of tail bounds tempts you to use the union bound to say all entries of $(\Pi A)^T(\Pi B) - A^TB$ are preserved simultaneously, but this leads to worse bounds.
So suppose that for $a,b\in\R^n$, $\norm a=\norm b=1$. Then $\norm{a-b}^2=\norm a^2+\norm b^2-2\inp ab$ and $\norm{\Pi a-\Pi b}^2=\norm{\Pi a}+\norm{\Pi b}^2-2\inp{\Pi a}{\Pi b}$. Therefore, under distribution $D$,
\begin{align*}
\norm{\inp{\Pi a}{\Pi b}-\inp ab}_p&=\frac12\norm{(\norm{\Pi a}^2-1)+(\norm{\Pi b}^2-1)+(\norm{a-b}^2-\norm{\Pi a-\Pi b}^2)}_p\\
&\le\frac12\left[\norm{\norm{\Pi a}^2-1}_p+\norm{\norm{\Pi b}^2-1}_p+\norm{a-b}^2\norm{\norm{\Pi\left(\frac{a-b}{\norm{a-b}}\right)}^2-1}_p\right]\\
&<\frac12[\epsilon\delta^{1/p}+\epsilon\delta^{1/p}+4\epsilon\delta^{1/p}]=3\epsilon\delta^{1/p}.
\end{align*}
Now, let's look at $A^TB$. Again write $A$ as having rows $x_1^T,\dotsc,x_n^T$ and $B$ having rows $y_1^T,\dotsc,y_n^T$. Define $X_{ij}=\frac1{\norm{x_i}\norm{y_j}}(\inp{\Pi x_i}{\Pi y_j}-\inp{x_i}{y_j})$. So we can write
\[\norm{(\Pi A)^T(\Pi B)-A^TB}_F^2=\sum_{i=1}^r\sum_{j=1}^p\norm{x_i}^2\norm{y_j}^2X_{ij}^2.\]
Since $p\ge2$, $\norm\cdot_{p/2}$ is still a norm, so we apply the triangle inequality to get
\begin{align*}
\norm{\norm{(\Pi A)^T(\Pi B)-A^TB}_F^2}_{p/2}&=\norm{\sum_{i,j}\norm{x_i}^2\norm{y_j}^2X_{ij}^2}_{p/2}\\
&\le\sum_{i,j}\norm{x_i}^2\norm{y_j}^2\norm{X_{ij}^2}_{p/2}\\
&<(3\epsilon\delta^{1/p})^2\sum_{i,j}\norm{x_i}^2\norm{y_j}^2
\end{align*}
Also note
\[\E \norm{(\Pi A)^T(\Pi B)-A^TB}_F^p = \norm{\norm{(\Pi A)^T(\Pi B)-A^TB}_F^2}^{p/2}_{p/2} < (3\epsilon\delta^{1/p}\norm A_F\norm B_F)^p.\]
Now we just apply Markov to get a tail bound.
\[\Pr\left(\norm{(\Pi A)^T(\Pi B)-A^TB}_F>3\epsilon\norm A_F\norm B_F\right)<\frac{\E\norm{(\Pi A)^T(\Pi B)-A^TB}_F^p}{(3\epsilon\norm A_F\norm B_F)^p}<\delta.\qedhere\]\end{proof}
So we've proved that if we have a $(\epsilon,\delta,p)$-JLMP distribution, that's good enough. How do we pick such a distribution? Any JL distribution works. For example, we can pick the $\Pi$ from PSet 1, Problem 3 (which is due to Thorup and Zhang \cite{TZ12}), with a random sign in each column and $m$ rows, where $m=O(1/\epsilon^2)$ to get a success probability $2/3$; this is an $(\eps,2/3,2)$-JLMP. This is nice because it's a $1$-pass algorithm, and we've reduced the problem to something we already have some ideas for.
\section{Further Numerical Linear Algebra}
Next time, we'll look at subspace embeddings.
\begin{definition}If $V\subseteq\R^n$ is a dimension-$d$ linear subspace, we say that $\Pi$ is an \emph{$\epsilon$ subspace embedding for $V$} if for every $x\in V$, $\norm{\Pi x}=(1\pm\epsilon)\norm x$.\end{definition}
We'll see how subspace embeddings relate to a lot of things we've seen. Problem 1 on the current PSet (due Thursday) shows that for any such $V$ there is a set of $N = \exp(cd)$ vectors such that if $\Pi$ satisfies the JL lemma conditions on those $N$ vectors then $\Pi$ is a subspace embedding for $V$.
\bibliographystyle{alpha}
\begin{thebibliography}{42}
\bibitem{CWSTOC}
Kenneth~Clarkson, David~Woodruff.
\newblock Numerical Linear Algebra in the Streaming Model.
\newblock \emph{STOC '09} 205--214, 2009.
\bibitem{CW87}
Don~Coppersmith, Shmuel~Winograd.
\newblock Matrix Multiplication via Arithmetic Progressions.
\newblock \emph{STOC '87} 1--6, 1987.
\bibitem{DKM}
Petros~Drineas, Ravi~Kannan, Michael~Mahoney.
\newblock Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication.
\newblock \emph{SIAM J. Computing} 36, 132--157, 2006.
\bibitem{KN12}
Daniel~M.~Kane, Jelani~Nelson.
\newblock Sparser Johnson-Lindenstrauss transforms.
\newblock \emph{SODA}, 1195--1206, 2012.
\bibitem{Sarlos06}
Tam\'{a}s~Sarl\'{o}s.
\newblock Improved Approximation Algorithms for Large Matrices via Random Projections.
\newblock \emph{FOCS}, 143--152, 2006.
\bibitem{Stothers}
Andrew~James~Stothers.
\newblock On the Complexity of Matrix Multiplication.
\newblock \emph{University of Edinburgh PhD Thesis}, 2010.
\bibitem{Strassen}
Volker~Strassen.
\newblock Gaussian Elimination is not Optimal.
\newblock \emph{Numer. Math.}, 13:354--356, 1969.
\bibitem{TZ12}
Mikkel~Thorup, Yin~Zhang.
\newblock Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation.
\newblock \emph{SIAM J. Comput.}, 41(2): 293--331, 2012.
\bibitem{Williams}
Virginia~Vassilevska~Williams.
\newblock Multiplying Matrices Faster than Coppersmith-Winograd.
\newblock \emph{STOC '12} 887--898, 2012.
\end{thebibliography}
\end{document}