File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

postgraduate thesis: Budget-limited data disambiguation

TitleBudget-limited data disambiguation
Authors
Advisors
Issue Date2013
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yang, X. [楊譞]. (2013). Budget-limited data disambiguation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5177333
AbstractThe problem of data ambiguity exists in a wide range of applications. In this thesis, we study “cost-aware" methods to alleviate the data ambiguity problems in uncertain databases and social-tagging data. In database applications, ambiguous (or uncertain) data may originate from data integration and measurement error of devices. These ambiguous data are maintained by uncertain databases. In many situations, it is possible to “clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement error, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In practice, a cleaning activity often involves a cost, may fail and may not remove all ambiguities. Moreover, the statistical information about how likely database entities can be cleaned may not be precisely known. We model the above aspects with the uncertain database cleaning problem, which requires us to make sensible decisions in selecting entities to clean in order to maximize the amount of ambiguous information removed under a limited budget. To solve this problem, we propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Social tagging data capture web users' textual annotations, called tags, for resources (e.g., webpages and photos). Since tags are given by casual users, they often contain noise (e.g., misspelled words) and may not be able to cover all the aspects of each resource. In this thesis, we design a metric to systematically measure the tagging quality of each resource based on the tags it has received. We propose an incentive-based tagging framework in order to improve the tagging quality. The main idea is to award users some incentive for giving (relevant) tags to resources. The challenge is, how should we allocate incentives to a large set of resources, so as to maximize the improvement of their tagging quality under a limited budget? To solve this problem, we propose a few efficient incentive allocation strategies. Experiments shows that our best strategy provides resources with a close-to-optimal gain in tagging quality. To summarize, we study the problem of budget-limited data disambiguation for uncertain databases and social tagging data | given a set of objects (entities from uncertain databases or web resources), how can we make sensible decisions about which object to \disambiguate" (to perform a cleaning activity on the entity or ask a user to tag the resource), in order to maximize the amount of ambiguous information reduced under a limited budget.
DegreeDoctor of Philosophy
SubjectData mining - Mathematical models
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/196458
HKU Library Item IDb5177333

 

DC FieldValueLanguage
dc.contributor.advisorCheung, DWL-
dc.contributor.advisorCheng, CK-
dc.contributor.authorYang, Xuan-
dc.contributor.author楊譞-
dc.date.accessioned2014-04-11T23:14:26Z-
dc.date.available2014-04-11T23:14:26Z-
dc.date.issued2013-
dc.identifier.citationYang, X. [楊譞]. (2013). Budget-limited data disambiguation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5177333-
dc.identifier.urihttp://hdl.handle.net/10722/196458-
dc.description.abstractThe problem of data ambiguity exists in a wide range of applications. In this thesis, we study “cost-aware" methods to alleviate the data ambiguity problems in uncertain databases and social-tagging data. In database applications, ambiguous (or uncertain) data may originate from data integration and measurement error of devices. These ambiguous data are maintained by uncertain databases. In many situations, it is possible to “clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement error, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In practice, a cleaning activity often involves a cost, may fail and may not remove all ambiguities. Moreover, the statistical information about how likely database entities can be cleaned may not be precisely known. We model the above aspects with the uncertain database cleaning problem, which requires us to make sensible decisions in selecting entities to clean in order to maximize the amount of ambiguous information removed under a limited budget. To solve this problem, we propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Social tagging data capture web users' textual annotations, called tags, for resources (e.g., webpages and photos). Since tags are given by casual users, they often contain noise (e.g., misspelled words) and may not be able to cover all the aspects of each resource. In this thesis, we design a metric to systematically measure the tagging quality of each resource based on the tags it has received. We propose an incentive-based tagging framework in order to improve the tagging quality. The main idea is to award users some incentive for giving (relevant) tags to resources. The challenge is, how should we allocate incentives to a large set of resources, so as to maximize the improvement of their tagging quality under a limited budget? To solve this problem, we propose a few efficient incentive allocation strategies. Experiments shows that our best strategy provides resources with a close-to-optimal gain in tagging quality. To summarize, we study the problem of budget-limited data disambiguation for uncertain databases and social tagging data | given a set of objects (entities from uncertain databases or web resources), how can we make sensible decisions about which object to \disambiguate" (to perform a cleaning activity on the entity or ask a user to tag the resource), in order to maximize the amount of ambiguous information reduced under a limited budget.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.subject.lcshData mining - Mathematical models-
dc.titleBudget-limited data disambiguation-
dc.typePG_Thesis-
dc.identifier.hkulb5177333-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.5353/th_b5177333-
dc.identifier.mmsid991036762489703414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats