CS229r syllabus

CS 229r: Mathematical Approaches to Data Privacy
Spring 2013

SYLLABUS

Time & Place: TuTh 10-11:30, Maxwell Dworkin 221

Instructors:
Salil Vadhan
Office: Maxwell Dworkin 337
Regular Office hours: Mon 1:30-2:30, Thu 11:30-12:30, in MD 337

Shopping Week Office Hours: Fri 1/25 2-3:30, Mon 1/28 11-12, Mon 1/28 1:30-2:30, Tue 1/29 11:30-12:30, Tue 1/29 2-3, Thu 1/31 11:30-12:30, Thu 1/31 2-3, Fri 2/1 10:15-11:15, Fri 2/1 1:30-3. Sign up on door or by emailing Carol Harlow (harlow@seas).

Jonathan Ullman
Office: Maxwell Dworkin 138
Office hours: Wed 1-3 in MD138 (tenatively)

E-mail address for questions: diffprivcourse@seas.harvard.edu
E-mail address for submitting homeworks: diffprivcourse-hw@seas.harvard.edu
Course website: http://www.courses.fas.harvard.edu/colgsas/3730

Subject Matter

This course is about the following question:

How can we enable the analysis of datasets with sensitive information about individuals while protecting the privacy of those individuals?

This question is motivated by the vast amounts of data about individuals that are being collected by companies, researchers, and the government (e.g. census data, genomic databases, web-search logs, GPS readings, social network activity). The sharing and analysis of such data can be extremely useful, enabling researchers to better understand human health and behavior, companies to better serve their customers, and governments to be more accountable to their citizens. However, a major challenge is that these datasets contain lots of sensitive information about individuals, which the data-holders are often ethically or legally obligated to protect. The traditional approach to protecting privacy when sharing data is to remove "personally identifying information,'' but it is now known that this approach does not work, because seemingly innocuous information is often sufficient to uniquely identify individuals. Indeed, there have been many high-profile examples in which individuals in supposedly anonymized datasets were re-identified by linking the remaining fields with other, publicly available datasets.

Over the past decade, a new line of work in theoretical computer science—differential privacy—has provided a framework for computing on sensitive datasets in which one can mathematically prove that individual-specific information does not leak. In addition to showing that many useful data analysis tasks can be accomplished while satisfying the strong privacy requirement of differential privacy, this line of work has also shown that differential privacy is quite rich theoretically, with connections to many other areas of theoretical computer science and mathematics (learning theory, statistics, cryptography, computational complexity, convex geometry, mechanism design,...) At the same time, differential privacy has attracted the attention of many communities outside theoretical computer science, such as databases, programming languages, computer security, statistics, and law and policy, and has potential for a significant impact on practice.

Our focus on this course will be on the mathematical theory of differential privacy and its connections to other areas. We may also touch on efforts to bring differential privacy to practice, and alternative approaches to data privacy outside the scope of differential privacy. There is a new multidisciplinary research effort at Harvard on Privacy Tools for Sharing Research Data, and this course is good preparation for those who want to get involved with the algorithmic aspects of that project.

Definite Topics

The definition of differential privacy
- Motivation & interpretation
- Linkage attacks and auxiliary information
- Equivalent formulations (Bayesian, simulation-based)
- What differential privacy does and does not promise
Basic differentially private mechanisms

Randomized response
The Laplace mechanism
Composition theorems
The exponential mechanism

Answering many queries with differential privacy

Synthetic data
Private multiplicative weights

Attacks and lower bounds
- Reconstruction algorithms
- Packing/volume arguments
- Attacks on stateless mechanisms
Computational complexity of differential privacy
- Hardness results for generating synthetic data
- Connection with traitor-tracing schemes
Improved differentially private algorithms
- Faster algorithms for answering many structured queries
- Algorithms that beat worst-case sensitivity

Potential Topics

(A subset will be selected, depending on time and student interest)

Multiparty differential privacy
- Computational differential privacy
- Secure multiparty computation
- Lower bounds for information-theoretic differential privacy
- Connection to communication complexity
Differential privacy & learning theory
- Privacy of learning tasks
- Boosting and differential privacy
- Query release vs. agnostic learning
Privacy for streaming algorithms
- Pan-private streaming algorithms
- Privacy under continual observation
Privacy for graph analysis
- Attacks on anonymized social networks
- Edge-level privacy vs. node-level privacy
- Releasing graph cuts
- Restricted sensitivity
Differential privacy & mechanism design
- Mechanism design via differential privacy
- Modelling privacy in mechanism design
- Private equilibrium computation
- Auctioning private data
Differential privacy & statistics
- Privacy-preserving statistical estimation, SVMs, and regression
- Robust statistics
Alternative privacy definitions
- Zero-knowledge privacy
- Crowd-blending privacy
- Natural differential privacy
- Concentrated differential privacy
- k-anonymity and variants
- Query auditing
- Privacy for the analyst
Differential privacy in practice
- Programming frameworks: Fuzz, PinQ, Airavat
- Practical implementations of differentially private algorithms
Workload optimal algorithms
- Upper and lower bounds via volume/packing arugments
- Upper and lower bounds via discrepancy
Privacy problems outside the scope of differential privacy
- behaviorial tracking and targeted advertising
- discrimination
- surveillance

Format and Goals

The main components of the course are as follows:

Reading and commenting: For every class meeting, we will assign reading (typically from either the Dwork-Roth monograph or from research papers) for you to do in advance. You will be expected to read and comment on this material prior by midnight before lecture, using the online forum NB.
Class participation: Our class meetings will be very interactive, with you collectively bringing out the key concepts, ideas, and intuition, as well as working through the difficult technical material together (with our guidance, of course). This will demand more of you than a standard lecture-based course, but the hope is that you will come away with a much deeper understanding of the material.
Problem Sets: There will be three problems sets during the course. Problem sets will be passed out on Thursdays (2/7, 2/21, and 3/14) and due back roughly two weeks later on Friday (2/22, 3/8, and 4/5) no later than 5pm (unless using late days)
Final Project: You will be expected to do a final project, on a topic of your choosing. Projects can be done individually or in pairs, with groups of three allowed for ambitious projects. You can do a project that is theoretical, is experimental, or involves system-building, and the project should provide good opportunities to connect the course material to your other interests and get some exposure to doing original research in differential privacy. The project will involve submitting topic ideas for feedback (due Wed, 3/13), a detailed project proposal (due Fri, 4/12), a written paper (due 5/3), and a project presentation (during reading period). We will post more details about the final project, including some directions to look for topics, early in the course.

By the end of the course, we hope that you will all be able to:

Understand the state of the art in differential privacy at a level sufficient to engage in research, apply the material in practice, and/or connect it to other areas,
Extract both the high-level ideas and low-level details when reading a mathematical text and identify interesting questions that are not answered,
Explain and collaboratively work through an advanced subject with your peers,
Formulate and carry out an interesting, short-term independent research project, and present the work in both written and oral form.

Prerequisites

It is important for students to have comfort with:

Rigorous, proof-based mathematics: as in any class in CS theory (e.g. CS 121 or CS 124) or mathematics (e.g. Math 23 or higher).
Reasoning about algorithms: as in CS 121 or CS 124.
Basic discrete probability: as in CS 124 or Stats 110.

During the course, we will touch upon many other topics, which are not covered in the aforementioned courses, and you should be prepared to read up on appropriate background material as needed.

Grading

For grading, we will place approximately equally weight on each of the following 3 categories:

Reading and commenting on the reading and class participation
Problem sets
Final project

Problem set solutions must be typed and submitted electronically. The deadline will typically be set at 5pm on a Friday. You are allowed 6 late days for the semester, of which at most 4 can be used on any individual problem set. (1 late day = 24 hours exactly).

The problem sets may require a lot of thought, so be sure to start them early. You are encouraged to discuss the course material and the homework problems with each other in small groups (2-3 people). Discussion of homework problems may include brainstorming and verbally walking through possible solutions, but should not include one person telling the others how to solve the problem. In addition, each person must write up their solutions independently, and these write-ups should not be checked against each other or passed around.

Reading

For roughly the first half of the course, we will largely be following a new monograph that is being written by Cynthia Dwork and Aaron Roth. Your comments and corrections on the reading will be made available to them, and will help them in polishing the monograph before it is published. In the second half of the course, many of our readings will be from original research papers.