File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Automatic and efficient privacy preserving and fault detection techniques for big-data systems
Title | Automatic and efficient privacy preserving and fault detection techniques for big-data systems |
---|---|
Authors | |
Advisors | |
Issue Date | 2021 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Li, T. O. [李梓安]. (2021). Automatic and efficient privacy preserving and fault detection techniques for big-data systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | This thesis focuses on two important topics concerning big-data systems’ security: enforcing Differential Privacy and conducting fault detection. Although these two topics are independent of each other, the solution proposed by this thesis to tackle the research problem of each topic uses similar methodology (i.e., solving a computationally-expensive problem by sampling).
Specifically, Differential Privacy (DP) is a powerful technique to preserve the privacy of a dataset analyzed by third-parties’ big-data mining queries. Given an input dataset and a query, enforcing DP requires computing a parameter called sensitivity: the greatest change on the query’s output when a data record is added to or removed from the input dataset. Unfortunately, directly computing the sensitivity value for a query and an input dataset is prohibitively inefficient, because it requires adding or removing every data record from the dataset and repeatedly running the same query on the dataset.
This thesis presents UPA, the first automated and efficient sensitivity inferring approach for big-data mining queries. Our key observation is that big-data systems’ (e.g., MapReduce’s) operators often have commutative and associative properties in order to enable parallelism among computers. Therefore, UPA leverages these properties to greatly reduce the repeated computations of inferring sensitivity at runtime. Also, we showed that sensitivity of a query often follows normal distribution, so the sensitivity of a query can be accurately inferred based on a small set of the query’s output changes when a data record is added to or removed from the input dataset. (i.e., UPA infers sensitivity by sampling). This thesis further presents Erebus, an extension of UPA which enhances UPA’s efficiency by orders of magnitude.
The methodology adopted by UPA to solve the challenge of enforcing DP (i.e., solving a computationally-expensive problem by sampling) also inspires us to solve the challenge of conducting Deep Learning System (DLS) testing, a technique to search for faults (i.e., incorrect inference results) of a DLS. Existing DLS testing techniques are unautomated (i.e., require heavy manual effort in configuration), and often detected only a tiny proportion of all possible faults of a DLS. We believe the main reason is that these testing techniques search for a DLS’s faults deterministically, yet a DLS’s faults occur probabilistically. However, searching for a DLS’s probabilistic faults is challenging because the probabilities which these faults would occur are often unknown. Inspecting all possible inputs of a DLS identifies all the faults, but it is prohibitively inefficient.
This thesis presents Themis, an automated and efficient DLS testing technique which is guaranteed to identify all faults of a DLS at high probability. Our observation is that the probabilities which a DLS’s faults would occur follow Bayesian hierarchy (i.e., a small set of random samples is sufficient to infer the probability), so Themis samples a small set of a DLS’s faults in order to infer the probability (i.e., Themis solves a computationally-expensive problem by sampling, similar to UPA). Our evaluation showed Themis identified 3.78X more faults than baselines by using similar time costs as the baselines |
Degree | Master of Philosophy |
Subject | Big data - Security measures |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/311686 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Cui, H | - |
dc.contributor.advisor | Wang, CL | - |
dc.contributor.author | Li, Tsz On | - |
dc.contributor.author | 李梓安 | - |
dc.date.accessioned | 2022-03-30T05:42:23Z | - |
dc.date.available | 2022-03-30T05:42:23Z | - |
dc.date.issued | 2021 | - |
dc.identifier.citation | Li, T. O. [李梓安]. (2021). Automatic and efficient privacy preserving and fault detection techniques for big-data systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/311686 | - |
dc.description.abstract | This thesis focuses on two important topics concerning big-data systems’ security: enforcing Differential Privacy and conducting fault detection. Although these two topics are independent of each other, the solution proposed by this thesis to tackle the research problem of each topic uses similar methodology (i.e., solving a computationally-expensive problem by sampling). Specifically, Differential Privacy (DP) is a powerful technique to preserve the privacy of a dataset analyzed by third-parties’ big-data mining queries. Given an input dataset and a query, enforcing DP requires computing a parameter called sensitivity: the greatest change on the query’s output when a data record is added to or removed from the input dataset. Unfortunately, directly computing the sensitivity value for a query and an input dataset is prohibitively inefficient, because it requires adding or removing every data record from the dataset and repeatedly running the same query on the dataset. This thesis presents UPA, the first automated and efficient sensitivity inferring approach for big-data mining queries. Our key observation is that big-data systems’ (e.g., MapReduce’s) operators often have commutative and associative properties in order to enable parallelism among computers. Therefore, UPA leverages these properties to greatly reduce the repeated computations of inferring sensitivity at runtime. Also, we showed that sensitivity of a query often follows normal distribution, so the sensitivity of a query can be accurately inferred based on a small set of the query’s output changes when a data record is added to or removed from the input dataset. (i.e., UPA infers sensitivity by sampling). This thesis further presents Erebus, an extension of UPA which enhances UPA’s efficiency by orders of magnitude. The methodology adopted by UPA to solve the challenge of enforcing DP (i.e., solving a computationally-expensive problem by sampling) also inspires us to solve the challenge of conducting Deep Learning System (DLS) testing, a technique to search for faults (i.e., incorrect inference results) of a DLS. Existing DLS testing techniques are unautomated (i.e., require heavy manual effort in configuration), and often detected only a tiny proportion of all possible faults of a DLS. We believe the main reason is that these testing techniques search for a DLS’s faults deterministically, yet a DLS’s faults occur probabilistically. However, searching for a DLS’s probabilistic faults is challenging because the probabilities which these faults would occur are often unknown. Inspecting all possible inputs of a DLS identifies all the faults, but it is prohibitively inefficient. This thesis presents Themis, an automated and efficient DLS testing technique which is guaranteed to identify all faults of a DLS at high probability. Our observation is that the probabilities which a DLS’s faults would occur follow Bayesian hierarchy (i.e., a small set of random samples is sufficient to infer the probability), so Themis samples a small set of a DLS’s faults in order to infer the probability (i.e., Themis solves a computationally-expensive problem by sampling, similar to UPA). Our evaluation showed Themis identified 3.78X more faults than baselines by using similar time costs as the baselines | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Big data - Security measures | - |
dc.title | Automatic and efficient privacy preserving and fault detection techniques for big-data systems | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2022 | - |
dc.identifier.mmsid | 991044494004103414 | - |