Managing the quality of crowdsourced databases

Zheng, Yudian; 鄭玉典

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991043982880403414

Supplementary

Citations:
Appears in Collections:
- Computer Science: Theses
- HKU Theses Online

postgraduate thesis: Managing the quality of crowdsourced databases

Title	Managing the quality of crowdsourced databases
Authors	Zheng, Yudian 鄭玉典
Advisors	Advisor(s):Cheng, CK
Issue Date	2017
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Zheng, Y. [鄭玉典]. (2017). Managing the quality of crowdsourced databases. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Many important data management and analytics tasks cannot be completely addressed by automated processes. For example, entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human input. Crowdsourcing platforms are an effective way to harness the capabilities of the crowd to apply human computation for such tasks. In recent years, crowdsourced data management has become an area of increasing interest in research and industry. Typical crowd workers are often associated with a large variety of expertise, background, and quality. As such, the crowdsourced database, which collects information from these workers, may be highly noisy and inaccurate. Thus it is of utter importance to manage the quality of crowdsourced databases. In this thesis, we identify and address two fundamental problems in crowdsourced quality management: (1) Task Assignment, which selects suitable tasks and assigns to appropriate crowd workers; (2) Truth Inference, which aggregates answers obtained from crowd workers to infer the final result. For the task assignment problem, we consider two common settings adopted in existing crowdsourcing solutions: task-based and worker-based. In the task-based setting, given a pool of n tasks, we are interested in which of the k tasks should be assigned to a worker. A poor assignment may not only waste time and money, but may also hurt the quality of a crowdsourcing application that depends on the workers' answers. We propose to consider evaluation metrics (e.g., Accuracy and F-score) that are relevant to an application and we explore how to optimally assign tasks in an online manner. In the worker-based setting, given a monetary budget and a set of workers, we study how workers should be selected, such that the tasks in hand can be accomplished successfully and economically. We observe that this is related to the aggregation of workers' qualities, and propose a solution that optimally aggregates the qualities from different workers, which is fundamental to selecting workers. For the truth inference problem, although there exist extensive solutions, we find that they are not compared extensively under the same framework, and it is hard for practitioners to select appropriate ones. We conduct a detailed survey on 17 existing solutions, and provide an in-depth analysis from various perspectives. Finally, we integrate the task assignment and truth inference in a unified framework, and apply them to two crowdsourcing applications, namely image tagging and question answering. For image tagging, where a worker is asked to answer the task, we select the correct label(s) among multiple given choices. We identify workers' unique characteristics in answering multi-label tasks, and study how it can help to solve the two problems. For question answering, where workers may have diverse qualities across different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. We leverage domain knowledge to accurately model a worker’s quality, and apply them to addressing the two problems.
Degree	Doctor of Philosophy
Subject	Database management Human computation
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/250761

DC Field	Value	Language
dc.contributor.advisor	Cheng, CK	-
dc.contributor.author	Zheng, Yudian	-
dc.contributor.author	鄭玉典	-
dc.date.accessioned	2018-01-26T01:59:28Z	-
dc.date.available	2018-01-26T01:59:28Z	-
dc.date.issued	2017	-
dc.identifier.citation	Zheng, Y. [鄭玉典]. (2017). Managing the quality of crowdsourced databases. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/250761	-
dc.description.abstract	Many important data management and analytics tasks cannot be completely addressed by automated processes. For example, entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human input. Crowdsourcing platforms are an effective way to harness the capabilities of the crowd to apply human computation for such tasks. In recent years, crowdsourced data management has become an area of increasing interest in research and industry. Typical crowd workers are often associated with a large variety of expertise, background, and quality. As such, the crowdsourced database, which collects information from these workers, may be highly noisy and inaccurate. Thus it is of utter importance to manage the quality of crowdsourced databases. In this thesis, we identify and address two fundamental problems in crowdsourced quality management: (1) Task Assignment, which selects suitable tasks and assigns to appropriate crowd workers; (2) Truth Inference, which aggregates answers obtained from crowd workers to infer the final result. For the task assignment problem, we consider two common settings adopted in existing crowdsourcing solutions: task-based and worker-based. In the task-based setting, given a pool of n tasks, we are interested in which of the k tasks should be assigned to a worker. A poor assignment may not only waste time and money, but may also hurt the quality of a crowdsourcing application that depends on the workers' answers. We propose to consider evaluation metrics (e.g., Accuracy and F-score) that are relevant to an application and we explore how to optimally assign tasks in an online manner. In the worker-based setting, given a monetary budget and a set of workers, we study how workers should be selected, such that the tasks in hand can be accomplished successfully and economically. We observe that this is related to the aggregation of workers' qualities, and propose a solution that optimally aggregates the qualities from different workers, which is fundamental to selecting workers. For the truth inference problem, although there exist extensive solutions, we find that they are not compared extensively under the same framework, and it is hard for practitioners to select appropriate ones. We conduct a detailed survey on 17 existing solutions, and provide an in-depth analysis from various perspectives. Finally, we integrate the task assignment and truth inference in a unified framework, and apply them to two crowdsourcing applications, namely image tagging and question answering. For image tagging, where a worker is asked to answer the task, we select the correct label(s) among multiple given choices. We identify workers' unique characteristics in answering multi-label tasks, and study how it can help to solve the two problems. For question answering, where workers may have diverse qualities across different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. We leverage domain knowledge to accurately model a worker’s quality, and apply them to addressing the two problems.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Database management	-
dc.subject.lcsh	Human computation	-
dc.title	Managing the quality of crowdsourced databases	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991043982880403414	-
dc.date.hkucongregation	2017	-
dc.identifier.mmsid	991043982880403414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Managing the quality of crowdsourced databases

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats