Budget-limited data disambiguation

Yang, Xuan; 楊譞

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5177333

Supplementary

Citations:
Appears in Collections:
- Computer Science: Theses
- HKU Theses Online

postgraduate thesis: Budget-limited data disambiguation

Title	Budget-limited data disambiguation
Authors	Yang, Xuan 楊譞
Advisors	Advisor(s):Cheung, DWL Cheng, CK
Issue Date	2013
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yang, X. [楊譞]. (2013). Budget-limited data disambiguation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5177333
Abstract	The problem of data ambiguity exists in a wide range of applications. In this thesis, we study “cost-aware" methods to alleviate the data ambiguity problems in uncertain databases and social-tagging data. In database applications, ambiguous (or uncertain) data may originate from data integration and measurement error of devices. These ambiguous data are maintained by uncertain databases. In many situations, it is possible to “clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement error, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In practice, a cleaning activity often involves a cost, may fail and may not remove all ambiguities. Moreover, the statistical information about how likely database entities can be cleaned may not be precisely known. We model the above aspects with the uncertain database cleaning problem, which requires us to make sensible decisions in selecting entities to clean in order to maximize the amount of ambiguous information removed under a limited budget. To solve this problem, we propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Social tagging data capture web users' textual annotations, called tags, for resources (e.g., webpages and photos). Since tags are given by casual users, they often contain noise (e.g., misspelled words) and may not be able to cover all the aspects of each resource. In this thesis, we design a metric to systematically measure the tagging quality of each resource based on the tags it has received. We propose an incentive-based tagging framework in order to improve the tagging quality. The main idea is to award users some incentive for giving (relevant) tags to resources. The challenge is, how should we allocate incentives to a large set of resources, so as to maximize the improvement of their tagging quality under a limited budget? To solve this problem, we propose a few efficient incentive allocation strategies. Experiments shows that our best strategy provides resources with a close-to-optimal gain in tagging quality. To summarize, we study the problem of budget-limited data disambiguation for uncertain databases and social tagging data \| given a set of objects (entities from uncertain databases or web resources), how can we make sensible decisions about which object to \disambiguate" (to perform a cleaning activity on the entity or ask a user to tag the resource), in order to maximize the amount of ambiguous information reduced under a limited budget.
Degree	Doctor of Philosophy
Subject	Data mining - Mathematical models
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/196458
HKU Library Item ID	b5177333

DC Field	Value	Language
dc.contributor.advisor	Cheung, DWL	-
dc.contributor.advisor	Cheng, CK	-
dc.contributor.author	Yang, Xuan	-
dc.contributor.author	楊譞	-
dc.date.accessioned	2014-04-11T23:14:26Z	-
dc.date.available	2014-04-11T23:14:26Z	-
dc.date.issued	2013	-
dc.identifier.citation	Yang, X. [楊譞]. (2013). Budget-limited data disambiguation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5177333	-
dc.identifier.uri	http://hdl.handle.net/10722/196458	-
dc.description.abstract	The problem of data ambiguity exists in a wide range of applications. In this thesis, we study “cost-aware" methods to alleviate the data ambiguity problems in uncertain databases and social-tagging data. In database applications, ambiguous (or uncertain) data may originate from data integration and measurement error of devices. These ambiguous data are maintained by uncertain databases. In many situations, it is possible to “clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement error, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In practice, a cleaning activity often involves a cost, may fail and may not remove all ambiguities. Moreover, the statistical information about how likely database entities can be cleaned may not be precisely known. We model the above aspects with the uncertain database cleaning problem, which requires us to make sensible decisions in selecting entities to clean in order to maximize the amount of ambiguous information removed under a limited budget. To solve this problem, we propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Social tagging data capture web users' textual annotations, called tags, for resources (e.g., webpages and photos). Since tags are given by casual users, they often contain noise (e.g., misspelled words) and may not be able to cover all the aspects of each resource. In this thesis, we design a metric to systematically measure the tagging quality of each resource based on the tags it has received. We propose an incentive-based tagging framework in order to improve the tagging quality. The main idea is to award users some incentive for giving (relevant) tags to resources. The challenge is, how should we allocate incentives to a large set of resources, so as to maximize the improvement of their tagging quality under a limited budget? To solve this problem, we propose a few efficient incentive allocation strategies. Experiments shows that our best strategy provides resources with a close-to-optimal gain in tagging quality. To summarize, we study the problem of budget-limited data disambiguation for uncertain databases and social tagging data \| given a set of objects (entities from uncertain databases or web resources), how can we make sensible decisions about which object to \disambiguate" (to perform a cleaning activity on the entity or ask a user to tag the resource), in order to maximize the amount of ambiguous information reduced under a limited budget.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.subject.lcsh	Data mining - Mathematical models	-
dc.title	Budget-limited data disambiguation	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5177333	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5177333	-
dc.identifier.mmsid	991036762489703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Budget-limited data disambiguation

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats