|Full text||Click to download.|
|Citation||In Proceedings of the 2007 ACM SIGKDD International Workshop
on Privacy, Security, and Trust in KDD
In this age of globalization, organizations need to publish their
micro-data owing to legal directives or share it with business associates
in order to remain competitive. This puts personal privacy at
risk. To surmount this risk, attributes that clearly identify individuals,
such as Name, Social Security Number, Driving
License Number, are generally removed or replaced by random
values. But this may not be enough because such de-identified
databases can sometimes be joined with other public databases on
attributes such as Gender, Date of Birth, and Zipcode to
re-identify individuals who were supposed to remain anonymous.
In literature, such an identity-leaking attribute combination is called
as a quasi-identifier. It is always critical to be able to recognize
quasi-identifiers and to apply to them appropriate protective measures
to mitigate the identity disclosure risk posed by join attacks.
In this paper, we start out by providing the first formal characterization and a practical technique to identify quasi-identifiers. We show an interesting connection between whether a set of columns forms a quasi-identifier and the number of distinct values assumed by the combination of the columns. We then use this characterization to come up with a probabilistic notion of anonymity. Again we show an interesting connection between the number of distinct values taken by a combination of columns and the anonymity it can offer. This allows us to find an ideal amount of generalization or suppression to apply to different columns in order to achieve probabilistic anonymity. We work through many examples and show that our analysis can be used to make a published database conform to privacy acts like HIPAA. In order to achieve the probabilistic anonymity, we observe that one needs to solve multiple 1-dimensional k-anonymity problems. We propose many efficient and scalable algorithms for achieving 1-dimensional anonymity.
Our algorithms are optimal in a sense that they minimally distort data and retain much of its utility.
Back to publications
Back to previous page