# Jonathan Ullman

Postdoctoral Fellow at
Harvard University CRCS
110 Maxwell Dworkin
33 Oxford St
Cambridge, MA, 02138

e-mail: jullman [at] seas [dot] harvard [dot] edu

I am postdoctoral fellow at Harvard University in the Center for Research on Computation and Society (CRCS). I am interested in theoretical computer science. More specifically, I work on algorithms for privacy-preserving data analysis and their connection to other areas of computer science such as cryptography, mechanism design, and learning theory.

I completed my Ph.D., also at Harvard University, in 2013. I had the very good fortune to be advised by Salil Vadhan.

## Recent News

• Our paper "Fingerprinting Codes and the Price of Approximate Differential Privacy" was accepted to STOC 2014!
• More to come!

## Research Papers (in reverse chronological order)

 Privately Solving Linear Programs [Abstract] [arXiv] Justin Hsu, Aaron Roth, Tim Roughgarden, Jonathan Ullman To appear in ICALP 2014 (Track A) - International Colloquium on Automata, Languages, and Programming Abstract: In this paper, we initiate the systematic study of solving linear programs under differential privacy. The first step is simply to define the problem: to this end, we introduce several natural classes of private linear programs that capture different ways sensitive data can be incorporated into a linear program. For each class of linear programs we give an efficient, differentially private solver based on the multiplicative weights framework, or we give an impossibility result. An Anti-Folk Theorem for Large Repeated Games with Imperfect Monitoring [Abstract][arXiv] Mallesh Pai, Aaron Roth, Jonathan Ullman Manuscript Abstract: We study infinitely repeated games in settings of imperfect monitoring. We first prove a family of theorems that show that when the signals observed by the players satisfy a condition known as differential privacy, that the folk theorem has little bite: for sufficiently small values of the privacy parameters, for a fixed discount factor, any equilibrium of the repeated game involve players playing approximate equilibria of the stage game in every period. Next, we argue that in large games (n player games in which unilateral deviations by single players have only a small impact on the utility of other players), many monitoring settings naturally lead to signals that satisfy differential privacy, for privacy parameters tending to zero as the number of players n grows large. We conclude that in such settings, the set of equilibria of the repeated game collapse to the set of equilibria of the stage game. Fingerprinting Codes and the Price of Approximate Differential Privacy [Abstract][arXiv] Mark Bun, Jonathan Ullman, Salil Vadhan To appear in STOC 2014 - Symposium on Theory of Computing Abstract: We show new lower bounds on the sample complexity of $(\varepsilon, \delta)$-differentially private algorithms that accurately answer large sets of counting queries. A counting query on a database $D \in (\{0,1\}^d)^n$ has the form "What fraction of the individual records in the database satisfy the property $q$?" We show that in order to answer an arbitrary set $\mathcal{Q}$ of $\gg nd$ counting queries on $D$ to within error $\pm \alpha$ it is necessary that $$n \geq \tilde{\Omega}\left( \frac{\sqrt{d} \log |\mathcal{Q}|}{\alpha^2 \varepsilon} \right).$$ This bound is optimal up to poly-logarithmic factors, as demonstrated by the Private Multiplicative Weights algorithm (Hardt and Rothblum, FOCS'10). In particular, our lower bound is the first to show that the sample complexity required for accuracy and $(\varepsilon, \delta)$-differential privacy is asymptotically larger than what is required merely for accuracy, which is $O(\log |\mathcal{Q}| / \alpha^2)$. In addition, we show that our lower bound holds for the specific case of $k$-way marginal queries (where $|\mathcal{Q}| = 2^k \binom{d}{k}$) when $\alpha$ is a constant. Our results rely on the existence of short fingerprinting codes (Boneh and Shaw, CRYPTO'95; Tardos, STOC'03), which we show are closely connected to the sample complexity of differentially private data release. We also give a new method for combining certain types of sample complexity lower bounds into stronger lower bounds. Faster Private Release of Marginals on Small Databases [Abstract][arXiv] Karthekeyan Chandrasekaran, Justin Thaler, Jonathan Ullman, Andrew Wan ITCS 2014 - Innovations in Theoretical Computer Science Abstract: We study the problem of answering $k$-way marginal queries on a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of the database's records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses on a dataset, and designing efficient algorithms for privately answering marginal queries has been identified as an important open problem in private data analysis. For any $k$, we give a differentially private online algorithm that runs in time $$\mathrm{poly}\left(n, 2^{o(d)} \right)$$ per query and answers any sequence of $\mathrm{poly}(n)$ many $k$-way marginal queries with error at most $\pm 0.01$ on every query, provided $n \gtrsim d^{0.51}$. To the best of our knowledge, this is the first algorithm capable of privately answering marginal queries with a non-trivial worst-case accuracy guarantee for databases containing $\mathrm{poly}(d, k)$ records in time $\exp(o(d))$. Our algorithm runs the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10) on a new approximate polynomial representation of the database. We derive our representation for the database by approximating the OR function restricted to low Hamming weight inputs using low-degree polynomials with coefficients of bounded $L_1$-norm. In doing so, we show new upper and lower bounds on the degree of such polynomials, which may be of independent approximation-theoretic interest. Differential Privacy for the Analyst via Private Equilibrium Computation [Abstract][arXiv] Justin Hsu, Aaron Roth, Jonathan Ullman STOC 2013 - Symposium on Theory of Computing Abstract: We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting differential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data analysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer analyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the private query-release problem as a two-player zero-sum game, which may be of independent interest. Answering n2+o(1) Counting Queries with Differential Privacy is Hard [Abstract][arXiv] Jonathan Ullman STOC 2013 - Symposium on Theory of Computing Invited to the special issue of SICOMP for selected papers from STOC 2013. To appear in SICOMP - Special issue for selected papers from STOC 2013 Abstract: A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries on a sensitive database. Counting queries are of the form "What fraction of individual records in the database satisfy the property $q$?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database $D \in (\{0,1\}^d)^n$, and $k = \tilde{\Theta}(n^2)$ arbitrary efficiently computable counting queries, runs in time $\mathrm{poly}(d, n)$, and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant-depth ($AC^0$) circuits. Our result is almost tight because it is known that $\tilde{\Omega}(n^2)$ counting queries can be answered efficiently while satisfying differential privacy. Moreover, many more than $n^2$ queries (even exponential in $n$) can be answered in exponential time. We prove our results by extending the connection between differentially private query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the algorithm as input, and by constructing a traitor-tracing scheme that is secure in this setting. Mechanism Design in Large Games: Incentives and Privacy [Abstract][arXiv] Michael Kearns, Mallesh Pai, Aaron Roth, Jonathan Ullman ITCS 2014 - Innovations in Theoretical Computer Science Abstract: We study the problem of implementing equilibria of complete information games in settings of incomplete information, and address this problem using "recommender mechanisms." A recommender mechanism is one that does not have the power to enforce outcomes or to force participation, rather it only has the power to suggestion outcomes on the basis of voluntary participation. We show that despite these restrictions, recommender mechanisms can implement equilibria of complete information games in settings of incomplete information under the condition that the game is large---i.e. that there are a large number of players, and any player's action affects any other's payoff by at most a small amount. Our result follows from a novel application of differential privacy. We show that any algorithm that computes a correlated equilibrium of a complete information game while satisfying a variant of differential privacy---which we call joint differential privacy---can be used as a recommender mechanism while satisfying our desired incentive properties. Our main technical result is an algorithm for computing a correlated equilibrium of a large game while satisfying joint differential privacy. Although our recommender mechanisms are designed to satisfy game-theoretic properties, our solution ends up satisfying a strong privacy property as well. No group of players can learn "much" about the type of any player outside the group from the recommendations of the mechanism, even if these players collude in an arbitrary way. As such, our algorithm is able to implement equilibria of complete information games, without revealing information about the realized types. Faster Algorithms for Privately Releasing Marginals [Abstract][arXiv] Justin Thaler, Jonathan Ullman, Salil Vadhan ICALP 2012 (Track A) - International Colloquium on Automata, Languages, and Programming Abstract: We study the problem of releasing $k$-way marginals of a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of $D$'s records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf.~Barak et.~al., PODS '07). We give an algorithm that runs in time $d^{O(\sqrt{k})}$ and releases a private summary capable of answering any $k$-way marginal query with at most $\pm .01$ error on every query as long as $n \geq d^{O(\sqrt{k})}$. To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of $k$-way marginal queries, which is $d^{\Theta(k)}$ (for $k \ll d$). Iterative Constructions and Private Data Release [Abstract][arXiv] Anupam Gupta, Aaron Roth, Jonathan Ullman TCC 2012 - Theory of Cryptography Conference Abstract: In this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error $O(|V|^{1.5})$. Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem. On the Zero-Error Capacity Threshold for Deletion Channels [Abstract][arXiv] Ian Kash, Michael Mitzenmacher, Justin Thaler, Jonathan Ullman ITA 2011 - Information Theory and Applications Workshop Abstract: We consider the zero-error capacity of deletion channels. Specifically, we consider the setting where we choose a codebook C consisting of strings of n bits, and our model of the channel corresponds to an adversary who may delete up to pn of these bits for a constant p. Our goal is to decode correctly without error regardless of the actions of the adversary. We consider what values of p allow non-zero capacity in this setting. We suggest multiple approaches, one of which makes use of the natural connection between this problem and the problem of finding the expected length of the longest common subsequence of two random sequences. Privately Releasing Conjunctions and the Statistical Query Barrier [Abstract][arXiv] Anupam Gupta, Moritz Hardt, Aaron Roth, Jonathan Ullman STOC 2011 - Symposium on Theory of Computing SICOMP - SIAM Journal on Computing Abstract: Suppose we would like to know all answers to a set of statistical queries~$C$ on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in~$C$. In this paper, we investigate how and when we can do better than this naive approach. We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of $C$ in Kearns' statistical-query (SQ) model. This gives a complete answer to the question when running time is not a concern. We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to $C$ can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. These results are interesting not only from a learning theoretic point of view, but also from the perspective of privacy-preserving data analysis. In this context, our second result leads to an algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1$\%$ average additive error. This presents significant progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ model as a new barrier in the design of differentially private algorithms. Course Allocation by Proxy Auction [Abstract] Scott Kominers, Mike Ruberry, Jonathan Ullman WINE 2010 - Workshop on Internet and Economics Abstract: We propose a new proxy bidding mechanism to allocate courses to students given students' reported preferences. Our mechanism is motivated by a specific strategic downgrading manipulation observed in the course allocation mechanism currently used at Harvard Business School (HBS). The proxy bidding mechanism simplifes students' decisions by directly incorporating downgrading into the mechanism. Our proxy bidding mechanism is Pareto efficient with respect to lexicographic preferences and can be extended to allow for a more expressive preference language which allows complementarity, substitutability, and indifference. Simulations suggest that the proxy bidding mechanism is robust to the manipulations observed in the HBS mechanism and may yield welfare improvements. PCPs and the Hardness of Generating Private Synthetic Data [Abstract] [ECCC] Salil Vadhan, Jonathan Ullman TCC 2011 - Theory of Cryptography Conference Invited to the Journal of Cryptology Abstract: Assuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm $\mathcal{A}$ that takes a database $D\in (\{0,1\}^d)^n$ and outputs a "synthetic database" $\hat{D}$ all of whose two-way marginals are approximately equal to those of $D$. (A two-way marginal is the fraction of database rows $x \in \{0,1\}^d$ with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS 07), who gave an algorithm running in time $\mathrm{poly}(n,2^d)$. Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC 09) with encodings based on the PCP Theorem. We also present both negative and positive results for generating "relaxed" synthetic data, where the fraction of rows in $D$ satisfying a predicate $c$ are estimated by applying $c$ to each row of $\hat{D}$ and aggregating the results in some way. The Price of Privately Releasing Contingency Tables and the Spectra of Random Matrices with Correlated Rows [Abstract] Shiva Kasiviswanathan, Mark Rudelson, Adam Smith, Jonathan Ullman STOC 2010 - Symposium on Theory of Computing Abstract: Marginal (contingency) tables are the method of choice for government agencies releasing statistical summaries of categorical data. In this paper, we derive lower bounds on how much distortion (noise) is necessary in these tables to ensure the privacy of sensitive data. We extend a line of recent work on impossibility results for private data analysis to a natural and important class of functionalities. Consider a database consisting of $n$ rows (one per individual), each row comprising $d$ binary attributes. For any subset of $T$ attributes of size $|T| = k$, the marginal table for $T$ has $2^k$ entries; each entry counts how many times in the database a particular setting of these attributes occurs. We provide lower bounds for releasing all $\binom{d}{k}$ $k$-attribute marginal tables under several different notions of privacy. We give efficient polynomial time attacks which allow an adversary to reconstruct sensitive information given insufficiently perturbed marginal table releases. In particular, for a constant $k$, we obtain a tight bound of $\Omega(\min{n^{1/2}, d^{(k-1)/2}})$ on the average distortion per entry for any mechanism that releases all $k$-attribute marginals while providing attribute privacy (a weak notion implied by most privacy definitions). Our reconstruction attacks require a new lower bound on the least singular value of a random matrix with correlated rows. Let $M(k)$ be a matrix with $\binom{d}{k}$ rows formed by taking all possible $k$-way entry-wise products of an underlying set of $d$ random vectors from $\{0,1\}^n$. For constant $k$, we show that the least singular value of $M(k)$ is $\Omega(d^{k/2}) with high probability (the same asymptotic bound as for independent rows). We obtain stronger lower bounds for marginal tables satisfying differential privacy. We give a lower bound of$\Omega(\min{n^{1/2}, d^{k/2}})$, which is tight for$n = \Omega(d^k)$. We extend our analysis to obtain stronger results for mechanisms that add instance-independent noise and weaker results when$k\$ is super-constant.