File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: CODED : SC-oriented data error detection
Title | CODED : SC-oriented data error detection |
---|---|
Authors | |
Advisors | Advisor(s):Cheng, CK |
Issue Date | 2018 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Yan, J. [嚴晶]. (2018). CODED : SC-oriented data error detection. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | A powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a user’s domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that sup- ports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real- world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches. |
Degree | Master of Philosophy |
Subject | Error analysis (Mathematics) Error-correcting codes (Information theory) |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/268432 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Cheng, CK | - |
dc.contributor.author | Yan, Jing | - |
dc.contributor.author | 嚴晶 | - |
dc.date.accessioned | 2019-03-21T01:40:23Z | - |
dc.date.available | 2019-03-21T01:40:23Z | - |
dc.date.issued | 2018 | - |
dc.identifier.citation | Yan, J. [嚴晶]. (2018). CODED : SC-oriented data error detection. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/268432 | - |
dc.description.abstract | A powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a user’s domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that sup- ports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real- world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Error analysis (Mathematics) | - |
dc.subject.lcsh | Error-correcting codes (Information theory) | - |
dc.title | CODED : SC-oriented data error detection | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_991044091308203414 | - |
dc.date.hkucongregation | 2019 | - |
dc.identifier.mmsid | 991044091308203414 | - |