File Download
Supplementary

postgraduate thesis: Automatic and efficient privacy preserving and fault detection techniques for big-data systems

TitleAutomatic and efficient privacy preserving and fault detection techniques for big-data systems
Authors
Advisors
Advisor(s):Cui, HWang, CL
Issue Date2021
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Li, T. O. [李梓安]. (2021). Automatic and efficient privacy preserving and fault detection techniques for big-data systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThis thesis focuses on two important topics concerning big-data systems’ security: enforcing Differential Privacy and conducting fault detection. Although these two topics are independent of each other, the solution proposed by this thesis to tackle the research problem of each topic uses similar methodology (i.e., solving a computationally-expensive problem by sampling). Specifically, Differential Privacy (DP) is a powerful technique to preserve the privacy of a dataset analyzed by third-parties’ big-data mining queries. Given an input dataset and a query, enforcing DP requires computing a parameter called sensitivity: the greatest change on the query’s output when a data record is added to or removed from the input dataset. Unfortunately, directly computing the sensitivity value for a query and an input dataset is prohibitively inefficient, because it requires adding or removing every data record from the dataset and repeatedly running the same query on the dataset. This thesis presents UPA, the first automated and efficient sensitivity inferring approach for big-data mining queries. Our key observation is that big-data systems’ (e.g., MapReduce’s) operators often have commutative and associative properties in order to enable parallelism among computers. Therefore, UPA leverages these properties to greatly reduce the repeated computations of inferring sensitivity at runtime. Also, we showed that sensitivity of a query often follows normal distribution, so the sensitivity of a query can be accurately inferred based on a small set of the query’s output changes when a data record is added to or removed from the input dataset. (i.e., UPA infers sensitivity by sampling). This thesis further presents Erebus, an extension of UPA which enhances UPA’s efficiency by orders of magnitude. The methodology adopted by UPA to solve the challenge of enforcing DP (i.e., solving a computationally-expensive problem by sampling) also inspires us to solve the challenge of conducting Deep Learning System (DLS) testing, a technique to search for faults (i.e., incorrect inference results) of a DLS. Existing DLS testing techniques are unautomated (i.e., require heavy manual effort in configuration), and often detected only a tiny proportion of all possible faults of a DLS. We believe the main reason is that these testing techniques search for a DLS’s faults deterministically, yet a DLS’s faults occur probabilistically. However, searching for a DLS’s probabilistic faults is challenging because the probabilities which these faults would occur are often unknown. Inspecting all possible inputs of a DLS identifies all the faults, but it is prohibitively inefficient. This thesis presents Themis, an automated and efficient DLS testing technique which is guaranteed to identify all faults of a DLS at high probability. Our observation is that the probabilities which a DLS’s faults would occur follow Bayesian hierarchy (i.e., a small set of random samples is sufficient to infer the probability), so Themis samples a small set of a DLS’s faults in order to infer the probability (i.e., Themis solves a computationally-expensive problem by sampling, similar to UPA). Our evaluation showed Themis identified 3.78X more faults than baselines by using similar time costs as the baselines
DegreeMaster of Philosophy
SubjectBig data - Security measures
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/311686

 

DC FieldValueLanguage
dc.contributor.advisorCui, H-
dc.contributor.advisorWang, CL-
dc.contributor.authorLi, Tsz On-
dc.contributor.author李梓安-
dc.date.accessioned2022-03-30T05:42:23Z-
dc.date.available2022-03-30T05:42:23Z-
dc.date.issued2021-
dc.identifier.citationLi, T. O. [李梓安]. (2021). Automatic and efficient privacy preserving and fault detection techniques for big-data systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/311686-
dc.description.abstractThis thesis focuses on two important topics concerning big-data systems’ security: enforcing Differential Privacy and conducting fault detection. Although these two topics are independent of each other, the solution proposed by this thesis to tackle the research problem of each topic uses similar methodology (i.e., solving a computationally-expensive problem by sampling). Specifically, Differential Privacy (DP) is a powerful technique to preserve the privacy of a dataset analyzed by third-parties’ big-data mining queries. Given an input dataset and a query, enforcing DP requires computing a parameter called sensitivity: the greatest change on the query’s output when a data record is added to or removed from the input dataset. Unfortunately, directly computing the sensitivity value for a query and an input dataset is prohibitively inefficient, because it requires adding or removing every data record from the dataset and repeatedly running the same query on the dataset. This thesis presents UPA, the first automated and efficient sensitivity inferring approach for big-data mining queries. Our key observation is that big-data systems’ (e.g., MapReduce’s) operators often have commutative and associative properties in order to enable parallelism among computers. Therefore, UPA leverages these properties to greatly reduce the repeated computations of inferring sensitivity at runtime. Also, we showed that sensitivity of a query often follows normal distribution, so the sensitivity of a query can be accurately inferred based on a small set of the query’s output changes when a data record is added to or removed from the input dataset. (i.e., UPA infers sensitivity by sampling). This thesis further presents Erebus, an extension of UPA which enhances UPA’s efficiency by orders of magnitude. The methodology adopted by UPA to solve the challenge of enforcing DP (i.e., solving a computationally-expensive problem by sampling) also inspires us to solve the challenge of conducting Deep Learning System (DLS) testing, a technique to search for faults (i.e., incorrect inference results) of a DLS. Existing DLS testing techniques are unautomated (i.e., require heavy manual effort in configuration), and often detected only a tiny proportion of all possible faults of a DLS. We believe the main reason is that these testing techniques search for a DLS’s faults deterministically, yet a DLS’s faults occur probabilistically. However, searching for a DLS’s probabilistic faults is challenging because the probabilities which these faults would occur are often unknown. Inspecting all possible inputs of a DLS identifies all the faults, but it is prohibitively inefficient. This thesis presents Themis, an automated and efficient DLS testing technique which is guaranteed to identify all faults of a DLS at high probability. Our observation is that the probabilities which a DLS’s faults would occur follow Bayesian hierarchy (i.e., a small set of random samples is sufficient to infer the probability), so Themis samples a small set of a DLS’s faults in order to infer the probability (i.e., Themis solves a computationally-expensive problem by sampling, similar to UPA). Our evaluation showed Themis identified 3.78X more faults than baselines by using similar time costs as the baselines-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshBig data - Security measures-
dc.titleAutomatic and efficient privacy preserving and fault detection techniques for big-data systems-
dc.typePG_Thesis-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2022-
dc.identifier.mmsid991044494004103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats