CS 229r: Mathematical Approaches to Data Privacy
Time & Place: TuTh 10-11:30, Maxwell Dworkin 221
Office: Maxwell Dworkin 337
Regular Office hours: Mon 1:30-2:30, Thu 11:30-12:30, in MD 337
Shopping Week Office Hours: Fri 1/25 2-3:30, Mon 1/28 11-12, Mon 1/28 1:30-2:30, Tue 1/29 11:30-12:30, Tue 1/29 2-3, Thu 1/31 11:30-12:30, Thu 1/31 2-3, Fri 2/1 10:15-11:15, Fri 2/1 1:30-3. Sign up on door or by emailing Carol Harlow (harlow@seas).
Office: Maxwell Dworkin 138
Office hours: Wed 1-3 in MD138 (tenatively)
E-mail address for questions:
E-mail address for submitting homeworks: email@example.com
Course website: http://www.courses.fas.harvard.edu/colgsas/3730
This course is about the following question:
How can we enable the analysis of datasets with sensitive information about individuals while protecting the privacy of those individuals?
This question is motivated by the vast amounts of data about individuals that are being collected by companies, researchers, and the government (e.g. census data, genomic databases, web-search logs, GPS readings, social network activity). The sharing and analysis of such data can be extremely useful, enabling researchers to better understand human health and behavior, companies to better serve their customers, and governments to be more accountable to their citizens. However, a major challenge is that these datasets contain lots of sensitive information about individuals, which the data-holders are often ethically or legally obligated to protect. The traditional approach to protecting privacy when sharing data is to remove "personally identifying information,'' but it is now known that this approach does not work, because seemingly innocuous information is often sufficient to uniquely identify individuals. Indeed, there have been many high-profile examples in which individuals in supposedly anonymized datasets were re-identified by linking the remaining fields with other, publicly available datasets.
Over the past decade, a new line of work in theoretical computer science—differential privacy—has provided a framework for computing on sensitive datasets in which one can mathematically prove that individual-specific information does not leak. In addition to showing that many useful data analysis tasks can be accomplished while satisfying the strong privacy requirement of differential privacy, this line of work has also shown that differential privacy is quite rich theoretically, with connections to many other areas of theoretical computer science and mathematics (learning theory, statistics, cryptography, computational complexity, convex geometry, mechanism design,...) At the same time, differential privacy has attracted the attention of many communities outside theoretical computer science, such as databases, programming languages, computer security, statistics, and law and policy, and has potential for a significant impact on practice.
Our focus on this course will be on the mathematical theory of differential privacy and its connections to other areas. We may also touch on efforts to bring differential privacy to practice, and alternative approaches to data privacy outside the scope of differential privacy. There is a new multidisciplinary research effort at Harvard on Privacy Tools for Sharing Research Data, and this course is good preparation for those who want to get involved with the algorithmic aspects of that project.
The main components of the course are as follows:
By the end of the course, we hope that you will all be able to:
It is important for students to have comfort with:
During the course, we will touch upon many other topics, which are not covered in the aforementioned courses, and you should be prepared to read up on appropriate background material as needed.
For grading, we will place approximately equally weight on each of the following 3 categories:
Problem set solutions must be typed and submitted electronically. The deadline will typically be set at 5pm on a Friday. You are allowed 6 late days for the semester, of which at most 4 can be used on any individual problem set. (1 late day = 24 hours exactly).
The problem sets may require a lot of thought, so be sure to start them early. You are encouraged to
discuss the course material and the homework problems with each other in small
groups (2-3 people). Discussion of homework problems may include
brainstorming and verbally walking through possible solutions, but should not
include one person telling the others how to solve the problem. In
addition, each person must write up their solutions independently, and these
write-ups should not be checked against each other or passed around.
For roughly the first half of the course, we will largely be following a new monograph that is being written by Cynthia Dwork and Aaron Roth. Your comments and corrections on the reading will be made available to them, and will help them in polishing the monograph before it is published. In the second half of the course, many of our readings will be from original research papers.