Full text | Click to download. |
Citation | In Proceedings of the 2007 ACM SIGKDD International Workshop
on Privacy, Security, and Trust in KDD
|
Authors | Sachin Lodha
Dilys Thomas |
In this age of globalization, organizations need to publish their
micro-data owing to legal directives or share it with business associates
in order to remain competitive. This puts personal privacy at
risk. To surmount this risk, attributes that clearly identify individuals,
such as Name, Social Security Number, Driving
License Number, are generally removed or replaced by random
values. But this may not be enough because such de-identified
databases can sometimes be joined with other public databases on
attributes such as Gender, Date of Birth, and Zipcode to
re-identify individuals who were supposed to remain anonymous.
In literature, such an identity-leaking attribute combination is called
as a quasi-identifier. It is always critical to be able to recognize
quasi-identifiers and to apply to them appropriate protective measures
to mitigate the identity disclosure risk posed by join attacks.
In this paper, we start out by providing the first formal characterization
and a practical technique to identify quasi-identifiers. We
show an interesting connection between whether a set of columns
forms a quasi-identifier and the number of distinct values assumed
by the combination of the columns. We then use this characterization
to come up with a probabilistic notion of anonymity. Again
we show an interesting connection between the number of distinct
values taken by a combination of columns and the anonymity it
can offer. This allows us to find an ideal amount of generalization
or suppression to apply to different columns in order to achieve
probabilistic anonymity. We work through many examples and
show that our analysis can be used to make a published database
conform to privacy acts like HIPAA. In order to achieve the probabilistic
anonymity, we observe that one needs to solve multiple
1-dimensional k-anonymity problems. We propose many efficient
and scalable algorithms for achieving 1-dimensional anonymity.
Our algorithms are optimal in a sense that they minimally distort
data and retain much of its utility.