Full text | Click to download. |
Citation | PhD Thesis, Stanford University, 2006
|
Author | Krishnaram Kenthapadi
|
Over the last twenty years, there has been a tremendous growth in the
amount of private data collected about individuals.
With the rapid growth in
database, networking, and computing technologies, such data can be
integrated and analyzed digitally. On the one hand, this has led to the
development of data mining tools that aim to infer useful trends from
this data. But, on the other hand, easy access to personal data poses a
threat to individual privacy. In this thesis, we provide models and
algorithms for protecting the privacy of individuals in such large data
sets while still allowing users to mine useful trends and statistics.
We focus on the problem of statistical disclosure control -
revealing aggregate statistics about a population while preserving the
privacy of individuals. A statistical database can be viewed as a
table containing personal records, where the rows correspond to
individuals and the columns correspond to different attributes. For
example, a medical database may contain attributes such as name, social
security number, address, age, gender, ethnicity, and medical history
for each patient. We would like the medical researchers to have some
form of access to this database so as to learn trends such as
correlation between age and heart disease, while maintaining individual
privacy. There are broadly two frameworks for protecting privacy in
statistical databases. In the interactive framework, the user
(researcher) queries the database through a privacy mechanism, which may
deny the query or alter the answer to the query in order to ensure
privacy. In the non-interactive framework, the original database is
first sanitized so as to preserve privacy and then the modified version
is released. We study methods under both these frameworks as each method
is useful in different contexts.
The first part of the thesis focuses on the interactive framework and
provides models and algorithms for two methods used in this
framework. We first consider the online query auditing problem: given a
sequence of queries that have already been posed about the data, their
corresponding answers and given a new query, deny the answer if privacy
can be breached or give the true answer otherwise. We uncover the
fundamental problem that query denials leak information. As this problem
was overlooked in previous work, some of the previously suggested
auditors can be used by an attacker to compromise the privacy of a large
fraction of the individuals in the data.
To overcome this problem, we introduce a new model called simulatable
auditing where query denials provably do not leak information. We also
describe a probabilistic notion of (partial) compromise, in order to
overcome the known limitations of the existing privacy definition. We
then present simulatable auditing algorithms under both these
definitions. The second problem we consider is output perturbation, in
which the database administrator computes exact answer to the query and then
outputs a perturbed answer (by adding random noise) as the response to
the query. Inspired by the desire to enable individuals to retain
control over their information, we provide a fault-tolerant distributed
implementation of output perturbation schemes, thereby eliminating the
need for a trusted database administrator. In the process, we provide
protocols for the cooperative generation of shares of random noise
according to different distributions.
The second part of the thesis focuses on the non-interactive framework
and considers two anonymization methods for publishing data for analysis
from a table containing personal records. We consider the k-Anonymity
model proposed by Samarati and Sweeney, and present approximation
algorithms for anonymizing databases. Then we propose a new method for
anonymizing data records, where the data records are clustered and then
cluster centers are published. To ensure privacy of the data records, we
impose the constraint that each cluster must contain no fewer than a
pre-specified number of data records. We provide approximation
algorithms to come up with such a clustering.