File Download
Supplementary

postgraduate thesis: CODED : SC-oriented data error detection

TitleCODED : SC-oriented data error detection
Authors
Advisors
Advisor(s):Cheng, CK
Issue Date2018
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yan, J. [嚴晶]. (2018). CODED : SC-oriented data error detection. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractA powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a user’s domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that sup- ports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real- world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches.
DegreeMaster of Philosophy
SubjectError analysis (Mathematics)
Error-correcting codes (Information theory)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/268432

 

DC FieldValueLanguage
dc.contributor.advisorCheng, CK-
dc.contributor.authorYan, Jing-
dc.contributor.author嚴晶-
dc.date.accessioned2019-03-21T01:40:23Z-
dc.date.available2019-03-21T01:40:23Z-
dc.date.issued2018-
dc.identifier.citationYan, J. [嚴晶]. (2018). CODED : SC-oriented data error detection. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/268432-
dc.description.abstractA powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a user’s domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that sup- ports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real- world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshError analysis (Mathematics)-
dc.subject.lcshError-correcting codes (Information theory)-
dc.titleCODED : SC-oriented data error detection-
dc.typePG_Thesis-
dc.description.thesisnameMaster of Philosophy-
dc.description.thesislevelMaster-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2019-
dc.identifier.mmsid991044091308203414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats