CS 229r: Mathematical Approaches to Data Privacy
Prof. Salil Vadhan
This course is about the following question:
How can we enable the analysis of datasets with sensitive information about individuals while protecting the privacy of those individuals?
This question is motivated by the vast amounts of data about individuals that are being collected by companies, researchers, and the government (e.g. census data, genomic databases, web-search logs, GPS readings, social network activity). The sharing and analysis of such data can be extremely useful, enabling researchers to better understand human health and behavior, companies to better serve their customers, and governments to be more accountable to their citizens. However, a major challenge is that these datasets contain lots of sensitive information about individuals, which the data-holders are often ethically or legally obligated to protect. The traditional approach to protecting privacy when sharing data is to remove "personally identifying information,'' but it is now known that this approach does not work, because seemingly innocuous information is often sufficient to uniquely identify individuals. Indeed, there have been many high-profile examples in which individuals in supposedly anonymized datasets were re-identified by linking the remaining fields with other, publicly available datasets.
Over the past decade, a new line of work in theoretical computer science—differential privacy—has provided a framework for computing on sensitive datasets in which one can mathematically prove that individual-specific information does not leak. In addition to showing that many useful data analysis tasks can be accomplished while satisfying the strong privacy requirement of differential privacy, this line of work has also shown that differential privacy is quite rich theoretically, with connections to many other areas of theoretical computer science and mathematics (learning theory, statistics, cryptography, computational complexity, convex geometry, mechanism design,...) At the same time, differential privacy has attracted the attention of many communities outside theoretical computer science, such as databases, programming languages, computer security, statistics, and law and policy, and has potential for a significant impact on practice.
Our focus on this course will be on the mathematical theory of differential privacy and its connections to other areas. We may also touch on efforts to bring differential privacy to practice, and alternative approaches to data privacy outside the scope of differential privacy. There is a new multidisciplinary research effort at Harvard on Privacy Tools for Sharing Research Data, and this course is good preparation for those who want to get involved with the algorithmic aspects of that project.