The data streaming model captures settings in which there is so much data that one
can only store a tiny fraction of it. It also captures settings where one * can * store
the dataset, but cannot afford to look at the full input every time one wants to answer a question about
the data. The goal of a streaming algorithm
is to output a very small summary, or "sketch" of the data, such that one can
still use the summary to (approximately) answer basic questions about the data.
The streaming algorithm will ideally compute the summary in a single pass over the input, with
each datum (i.e., stream update) being processed very quickly.

** Why you should take this course.**
There is the obvious reason that the amount of data in the world
is exploding. Hence, the stringent constraints of the data streaming model
increasingly capture real-world situations.
In addition, this course will teach basic methods and tools for designing and analyzing
randomized algorithms. This toolset and way of thinking will be broadly useful even
outside of streaming environments.

This course is listed as a graduate elective and therefore its primary audience is computer science masters and doctoral students. While there are no formal prerequisites for the course, familiarity with elementary probability theory and data structures is expected. General mathematical maturity, e.g. comfort with proofs, will also be assumed.

There is no course textbook. We will primarily be following Amit Chakrabarti's course notes, and occasionally Andrew McGregor's slides. I will post a reference for each lecture by the start of class.

Grades in the course will be based on three problem sets and a final project. The problem sets will count for 60% of the grade, while the final project will count for 40%.

This course focuses on the design and analysis of streaming algorithms from a theoretical perspective. While I will occasionally discuss implementation issues and practical efficiency in lecture, the problem sets will not require programming.

The final project can be either theory or implementation-based. Example course projects are: (1) Read a recent research paper related to a topic covered in the class and write a comprehensive, readable summary and your opinion of it; (2) Work on a research problem related to the course material and your own research, and write a paper following a conference-paper-like format that describes your findings; (3) Implement one or more algorithms covered in the course or from the research literature, and do a comprehensive performance evaluation of your implementation.

The course project can be done collaboratively in groups of up to 3 people. In fact, collaboration is encouraged. However, the more people in a group, the more substantial the project is expected to be.

After class or by appointment. My office is currently Car Barn 312-B, but I will most likely move to an office in St. Mary's Hall in October.

Below is the **highly tentative** schedule for the course.
I will modify it as the semester progresses.
Topics and readings will not become "official" until the date of lecture.
Hence, I do not recommend reading ahead by more than one lecture, if at all.

Class
Number | Date | Description |
Readings |

Part 1. Insert-Only Streams | |||

1 | 9/1 | Introduction to streaming algorithms; course overview; fingerprinting and the power of randomness | Lecture 0 of Amit's notes. Also read Lee Rhodes' exposition of the benefits of using sketching algorithms in industrial systems. For fingerprinting, see Justin's handwritten notes |

2 | 9/6 | Answering Point Queries and Finding Frequent Items in Insert-Only Streams | Lecture 1 of Amit's notes. See also Justin's handwritten notes |

3 | 9/8 | Tools from probability theory: Markov's Inequality, Chebyshev's Inequality, Chernoff Bounds, Inclusion-Exclusion Principle, Variance Reduction via Averaging. Homework 1 Released
| A handy reference on tools from probability theory is Grigory's slides. See also Justin's handwritten notes |

4 | 9/13 | Sampling Algorithms for Insert-Only Streams: Approximate Medians, Reservoir sampling, Itemset Frequency Estimation. | Andrew's slides. See also Justin's handwritten notes. |

5 | 9/15 | AMS Sampling For Estimating High Frequency Moments. | Lecture 5 of Amit's notes. See also Justin's handwritten notes. |

6 | 9/20 | Estimating Distinct Elements in Insert-Only Streams, Part 1: Simple, supoptimal algorithm. | Lecture 2 of Amit's notes. See also Justin's handwritten notes |

7 | 9/22 | Estimating Distinct Elements in Insert-Only Streams Part 2: Hyperloglog, KMV, Adaptive Sampling | For Adaptive Sampling, see Lecture 3 of Amit's notes. For other content, see Justin's handwritten notes |

8 | 9/27 | Detour: Dictionary Data Structures: Chaining, Linear Probing, Cuckoo Hashing, The Power of Two Choices | Justin's handwritten notes. |

Part 2. Turnstile Streams: Linear Sketches | |||

9 | 9/29 | Answering Point Queries: Count-Min and Count Sketch. Homework 1 due.
| Lecture 4 of Amit's notes See also Justin's handwritten notes. |

10 | 10/4 | Finishing Count Sketch Analysis. Estimating the Second Frequency Moment via the Tug-of-War Sketch. Homework 2 released.
| Lecture 6 of Amit's notes. For the Count Sketch, see also Justin's handwritten notes. |

11 | 10/6 | Linear Sketches, Johnson-Lindenstrauss, Random Projections for Dimensionality Reduction, Estimating Small Frequency Moments via Stable Distributions | Lecture 7 of Amit's notes See also Justin's handwritten notes |

12 | 10/11 | Bloom Filters. Sparse Recovery via Invertible Bloom Lookup Tables. Peeling Algorithms. | For Sparse Recovery algorithms, see these slides. For Bloom Filters, see the excellent Wikipedia article or Michael Mitzenmacher's lecture notes. Justin's handwritten notes may also be useful. |

13 | 10/13 | Applications of Sparse Recovery Algorithms: Set Reconciliation, Biff Codes, L_0 Sampling in Turnstile Streams. Intro to Graph Streams. | For everything except graph streams, see Justin's handwritten notes. For L_0 Sampling, see also Andrew's slides. |

14 | 10/18 | Homework Review | |

15 | 10/20 | Streaming Graph Algorithms: Graph Connectivity and Spanners in the Insert-Only Model. Graph Connectivity in the Turnstile Model via L_0 Sampling. | For an intro to graph streams and connectivity/spanners in the insert-only model, see Andrew's Lecture 2.1 slides. For Graph Connectivity in the turnstile model, see Andrew's Lecture 2.2 slides. See also Justin's handwritten notes. |

Assorted Topics | |||

16 | 10/25 | Lower bounds via communication complexity, Part 1 Homework 2 due. Homework 3 released.
| Lecture 15 in Amit's notes |

17 | 10/27 | Lower bounds via communication complexity, Part 2 | Lecture 16 in Amit's notes |

18 | 11/1 | Lower bounds via communication complexity, Part 3 | Lecture 17 in Amit's notes |

19 | 11/3 | Stream Verification Part 1 | These slides and these slides together cover much of the content. |

20 | 11/8 | Problem Set Review | No reading. |

21 | 11/10 | Stream Verification Part 2 | See these slides |

22 | 11/15 | Exact Medians Using Multiple Passes in Insert-Only Streams | Lecture 9 in Amit's notes |

23 | 11/17 | Geometric streams, coresets, approximating Minimum Enclosing Ball Homework 3 due
| Lecture 11 in Amit's notes |

24 | 11/22 | CR-Precis. Mergeable Summaries. | For CR-Precis, see Slide 11 of Andrew's Slides. For mergeable summaries, see the original paper by Agarwal et al. |

25 | 11/29 | Clustering | Lecture 12 in Amit's notes and Andrew's Slides |

26 | 12/1 | Problem Set Review (probably) | Reading TBD |

27 | 12/6 | Topic TBD. Possibly random-order streams. | Reading TBD |

- An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems by Arnab Bhattacharyya, Palash Dey, and David P. Woodruff.
- Beating CountSketch for Heavy Hitters in Insertion Streams by Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff.
- Bias-Aware Sketches by Jiecao Chen and Qin Zhang
- Optimal Quantile Approximation in Streams by Zohar Karnin, Kevin Lang, Edo Liberty.

- Parameterized streaming: maximal matching and vertex cover by Rajesh Chitnis, Graham Cormode, MohammadTaghi Hajiaghayi, and Morteza Monemizadeh.
- Dynamic Graphs in the Sliding-Window Model by Michael S. Crouch, Andrew McGregor, and Daniel Stubbs.
- Better Algorithms for Counting Triangles in Data Streams by Andrew McGregor, Sofya Vorotnikova, and Hoa T. Vu.
- Single Pass Spectral Sparsification in Dynamic Streams by ichael Kapralov, Yin Tat Lee, Cameron Musco, Christopher Musco, and Aaron Sidford.

- Correlation clustering in data streams by K. J. Ahn, G. Cormode, S. Guha, A. McGregor, and A. Wirth.
- Dimensionality Reduction for k-Means Clustering and Low Rank Approximation Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.
- Randomized Dimensionality Reduction for k-means Clustering by Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas.

- Streaming Communication Protocols by Lucas Boczkowski, Iordanis Kerenidis, Frédéric Magniez
- Subspace Embeddings for the Polynomial Kernel by Haim Avron, Huy Nguyen, and David P. Woodruff.
- Stochastic Streams: Sample Complexity vs. Space Complexity by Michael Crouch, Andrew McGregor, Gregory Valiant, and David P. Woodruff.
- Strong Fooling Sets for Multi-Player Communication with Applications to Deterministic Estimation of Stream Statistics by Amit Chakrabarti and Sagar Kale.
- Better Streaming Algorithms for the Maximum Coverage Problem by Andrew McGregor and Hoa T. Vu.
- Almost Optimal Streaming Algorithms for Coverage Problems by Mohammadhossein Bateni, Hossein Esfandiari, and Vahab Mirrokni
- Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors by Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, Lin F. Yang.
- The Johnson-Lindenstrauss Transform: An Empirical Study by Suresh Venkatasubramanian and Qiushi Wang
- A Near-Optimal Algorithm for Estimating the Entropy of a Stream by Amit Chakrabarti, Graham Cormode, and Andrew McGregor

- Best-Order Streaming Model by AD Sarma, RJ Lipton, and D. Nanongkai.
- Annotations in Data Streams by Amit Chakrabarti, Graham Cormode, Andrew McGregor, and Justin Thaler.
- Verifiable Stream Computation and Arthur-Merlin Communication by Amit Chakrabarti, Graham Cormode, Andrew McGregor, Justin Thaler, and Suresh Venkatasubramanian.