Automatic and efficient privacy preserving and fault detection techniques for big-data systems

Li, Tsz On; 李梓安

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Automatic and efficient privacy preserving and fault detection techniques for big-data systems

Title	Automatic and efficient privacy preserving and fault detection techniques for big-data systems
Authors	Li, Tsz On 李梓安
Advisors	Advisor(s):Cui, H Wang, CL
Issue Date	2021
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Li, T. O. [李梓安]. (2021). Automatic and efficient privacy preserving and fault detection techniques for big-data systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	This thesis focuses on two important topics concerning big-data systems’ security: enforcing Differential Privacy and conducting fault detection. Although these two topics are independent of each other, the solution proposed by this thesis to tackle the research problem of each topic uses similar methodology (i.e., solving a computationally-expensive problem by sampling). Specifically, Differential Privacy (DP) is a powerful technique to preserve the privacy of a dataset analyzed by third-parties’ big-data mining queries. Given an input dataset and a query, enforcing DP requires computing a parameter called sensitivity: the greatest change on the query’s output when a data record is added to or removed from the input dataset. Unfortunately, directly computing the sensitivity value for a query and an input dataset is prohibitively inefficient, because it requires adding or removing every data record from the dataset and repeatedly running the same query on the dataset. This thesis presents UPA, the first automated and efficient sensitivity inferring approach for big-data mining queries. Our key observation is that big-data systems’ (e.g., MapReduce’s) operators often have commutative and associative properties in order to enable parallelism among computers. Therefore, UPA leverages these properties to greatly reduce the repeated computations of inferring sensitivity at runtime. Also, we showed that sensitivity of a query often follows normal distribution, so the sensitivity of a query can be accurately inferred based on a small set of the query’s output changes when a data record is added to or removed from the input dataset. (i.e., UPA infers sensitivity by sampling). This thesis further presents Erebus, an extension of UPA which enhances UPA’s efficiency by orders of magnitude. The methodology adopted by UPA to solve the challenge of enforcing DP (i.e., solving a computationally-expensive problem by sampling) also inspires us to solve the challenge of conducting Deep Learning System (DLS) testing, a technique to search for faults (i.e., incorrect inference results) of a DLS. Existing DLS testing techniques are unautomated (i.e., require heavy manual effort in configuration), and often detected only a tiny proportion of all possible faults of a DLS. We believe the main reason is that these testing techniques search for a DLS’s faults deterministically, yet a DLS’s faults occur probabilistically. However, searching for a DLS’s probabilistic faults is challenging because the probabilities which these faults would occur are often unknown. Inspecting all possible inputs of a DLS identifies all the faults, but it is prohibitively inefficient. This thesis presents Themis, an automated and efficient DLS testing technique which is guaranteed to identify all faults of a DLS at high probability. Our observation is that the probabilities which a DLS’s faults would occur follow Bayesian hierarchy (i.e., a small set of random samples is sufficient to infer the probability), so Themis samples a small set of a DLS’s faults in order to infer the probability (i.e., Themis solves a computationally-expensive problem by sampling, similar to UPA). Our evaluation showed Themis identified 3.78X more faults than baselines by using similar time costs as the baselines
Degree	Master of Philosophy
Subject	Big data - Security measures
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/311686

DC Field	Value	Language
dc.contributor.advisor	Cui, H	-
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Li, Tsz On	-
dc.contributor.author	李梓安	-
dc.date.accessioned	2022-03-30T05:42:23Z	-
dc.date.available	2022-03-30T05:42:23Z	-
dc.date.issued	2021	-
dc.identifier.citation	Li, T. O. [李梓安]. (2021). Automatic and efficient privacy preserving and fault detection techniques for big-data systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/311686	-
dc.description.abstract	This thesis focuses on two important topics concerning big-data systems’ security: enforcing Differential Privacy and conducting fault detection. Although these two topics are independent of each other, the solution proposed by this thesis to tackle the research problem of each topic uses similar methodology (i.e., solving a computationally-expensive problem by sampling). Specifically, Differential Privacy (DP) is a powerful technique to preserve the privacy of a dataset analyzed by third-parties’ big-data mining queries. Given an input dataset and a query, enforcing DP requires computing a parameter called sensitivity: the greatest change on the query’s output when a data record is added to or removed from the input dataset. Unfortunately, directly computing the sensitivity value for a query and an input dataset is prohibitively inefficient, because it requires adding or removing every data record from the dataset and repeatedly running the same query on the dataset. This thesis presents UPA, the first automated and efficient sensitivity inferring approach for big-data mining queries. Our key observation is that big-data systems’ (e.g., MapReduce’s) operators often have commutative and associative properties in order to enable parallelism among computers. Therefore, UPA leverages these properties to greatly reduce the repeated computations of inferring sensitivity at runtime. Also, we showed that sensitivity of a query often follows normal distribution, so the sensitivity of a query can be accurately inferred based on a small set of the query’s output changes when a data record is added to or removed from the input dataset. (i.e., UPA infers sensitivity by sampling). This thesis further presents Erebus, an extension of UPA which enhances UPA’s efficiency by orders of magnitude. The methodology adopted by UPA to solve the challenge of enforcing DP (i.e., solving a computationally-expensive problem by sampling) also inspires us to solve the challenge of conducting Deep Learning System (DLS) testing, a technique to search for faults (i.e., incorrect inference results) of a DLS. Existing DLS testing techniques are unautomated (i.e., require heavy manual effort in configuration), and often detected only a tiny proportion of all possible faults of a DLS. We believe the main reason is that these testing techniques search for a DLS’s faults deterministically, yet a DLS’s faults occur probabilistically. However, searching for a DLS’s probabilistic faults is challenging because the probabilities which these faults would occur are often unknown. Inspecting all possible inputs of a DLS identifies all the faults, but it is prohibitively inefficient. This thesis presents Themis, an automated and efficient DLS testing technique which is guaranteed to identify all faults of a DLS at high probability. Our observation is that the probabilities which a DLS’s faults would occur follow Bayesian hierarchy (i.e., a small set of random samples is sufficient to infer the probability), so Themis samples a small set of a DLS’s faults in order to infer the probability (i.e., Themis solves a computationally-expensive problem by sampling, similar to UPA). Our evaluation showed Themis identified 3.78X more faults than baselines by using similar time costs as the baselines	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Big data - Security measures	-
dc.title	Automatic and efficient privacy preserving and fault detection techniques for big-data systems	-
dc.type	PG_Thesis	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2022	-
dc.identifier.mmsid	991044494004103414	-

File Download

Supplementary

postgraduate thesis: Automatic and efficient privacy preserving and fault detection techniques for big-data systems

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats