Modeling Re-identification Probability in Differentially Private Data Release for Data Analytics: a Case of Kenya

Muturi, Peter N

View/Open

Full-text (4.633Mb)

Date

2024

Author

Muturi, Peter N

Type

Thesis

Language

Metadata

Show full item record

Abstract

With the proliferation of smart devices and the Internet of Things, large volumes of data are being generated quickly and in various formats every day, leading to what is commonly referred to as a data deluge. This phenomenon has come to be known as Big Data. Data in its raw form is not very useful, but it is the source from which information, knowledge, and even wisdom are obtained. Storing large datasets without processing them to add value to the data keepers may be equited to building data tombs, which may not be very useful. There is a need to extract the value from the voluminous data available through data analytics. The process of making connections, identifying patterns, predicting behavior, and personalizing interactions during data analytics is useful in extracting implicit, previously unknown, and valuable information. However, the process poses a threat of breaching personal information privacy. Similarly, the need for secondary data analysis has gained popularity, increasing the demand for data release and sharing. There is a need to enhance data release that protects the privacy of the data subjects and ensures retention of the analytical utility of the data released to leverage the potential of data analytics and secondary analyses. Some models used to preserve data privacy reduce analytical utility, making resultant datasets unsuitable for analytics and secondary analyses. There is a need to find a good trade-off between data privacy and analytical utility in data release. The model that promises to provide a balance between data privacy and analytical utility, the two antagonistic goals of private data release, is ε-differential privacy. However, the model has received fewer practical applications due to challenges relating to its theoretical mathematical expression that is not in a utilitarian format. Reviewed literature demonstrated the model hasn’t gotten wide implementation due to challenges of establishing its privacy parameter, the epsilon (ε) value. The reviewed literature suggested a method of estimating this privacy parameter. However, the variable for estimating this privacy parameter that was not provided in a heuristic manner was the probability or risk of being re-identified. The theoretical approach of computing the risk tends to give an overrated risk, whose results are high levels of anonymization. High level anonymization reduces the analytical utility of datasets. The study proposed a causal relationship model that provides a realistic estimate of the probability of being re-identified. Having a realistic re-identification risk would make the choice of privacy parameter that Page vi of 168 balance data privacy and analytical utility utilitarian. The proposed model on the probability of being re-identified was validated empirically using quantitative data from a quasi-experimental design using real-world datasets. The validation confirmed the hypothesized causal relationship and also established realistic regional re-identification risk to be at 5.3%, which is within an acceptable range of less than 9%. The study has modeled re-identification probability by introducing predictor constructs that realistically estimate the risk of re-identification. The independent constructs are (1) the analytical competence of the adversary; (2) the distinguishing power of the attributes of the anonymized datasets that are released; and (3) the linkage mapping of the auxiliary identified datasets. The constructs’ observable indicators have also been outlined and explained. The hypotheses for each of the three constructs to be positively influencing re-identification risk were proven to be true. The study also provided the predictive strength of each of the predictor constructs as well as their collective predictive impact. Further, the study established a mediation effect among the constructs, resulting in an improved re-identification risk model. With a model for realistic re-identification risk, the appropriate privacy parameter (ε) can now be computed and provide the much-needed data privacy and utility trade-off. The trade-off will enable private data release, which supports data analytics.