Self-Identifying Data for Fair Use

Stephen Chong, Christian Skalka, and Jeffrey A. Vaughan

Journal of Data and Information Quality 5(3), December 2014.

Abstract.

Public-use earth science datasets are a useful resource with the unfortunate feature that their provenance is easily disconnected from their content. “Fair-use policies” typically associated with these datasets require appropriate attribution of providers by users, but sound and complete attribution is difficult if provenance information is lost. To address this we introduce a technique to directly associate provenance information with sensor datasets. Our technique is similar to traditional watermarking but is intended for application to unstructured time-series datasets. Our approach is potentially imperceptible given sufficient margins of error in datasets, and is robust to a number of benign but likely transformations including truncation, rounding, bit-flipping, sampling, and reordering. We provide algorithms for both one-bit and blind mark checking, and show how our system can be adapted to various data representation types. Our algorithms are probabilistic in nature and are characterized by both combinatorial and empirical analyses. Mark embedding can be applied at any point in the data lifecycle, allowing adaptation of our scheme to social or scientific concerns.