File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Managing the quality of crowdsourced databases
Title | Managing the quality of crowdsourced databases |
---|---|
Authors | |
Advisors | Advisor(s):Cheng, CK |
Issue Date | 2017 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Zheng, Y. [鄭玉典]. (2017). Managing the quality of crowdsourced databases. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Many important data management and analytics tasks cannot be completely addressed by automated processes. For example, entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human input. Crowdsourcing platforms are an effective way to harness the capabilities of the crowd to apply human computation for such tasks. In recent years, crowdsourced data management has become an area of increasing interest in research and industry.
Typical crowd workers are often associated with a large variety of expertise, background, and quality. As such, the crowdsourced database, which collects information from these workers, may be highly noisy and inaccurate. Thus it is of utter importance to manage the quality of crowdsourced databases. In this thesis, we identify and address two fundamental problems in crowdsourced quality management: (1) Task Assignment, which selects suitable tasks and assigns to appropriate crowd workers; (2) Truth Inference, which aggregates answers obtained from crowd workers to infer the final result.
For the task assignment problem, we consider two common settings adopted in existing crowdsourcing solutions: task-based and worker-based. In the task-based setting, given a pool of n tasks, we are interested in which of the k tasks should be assigned to a worker. A poor assignment may not only waste time and money, but may also hurt the quality of a crowdsourcing application that depends on the workers' answers. We propose to consider evaluation metrics (e.g., Accuracy and F-score) that are relevant to an application and we explore how to optimally assign tasks in an online manner. In the worker-based setting, given a monetary budget and a set of workers, we study how workers should be selected, such that the tasks in hand can be accomplished successfully and economically. We observe that this is related to the aggregation of workers' qualities, and propose a solution that optimally aggregates the qualities from different workers, which is fundamental to selecting workers.
For the truth inference problem, although there exist extensive solutions, we find that they are not compared extensively under the same framework, and it is hard for practitioners to select appropriate ones. We conduct a detailed survey on 17 existing solutions, and provide an in-depth analysis from various perspectives.
Finally, we integrate the task assignment and truth inference in a unified framework, and apply them to two crowdsourcing applications, namely image tagging and question answering. For image tagging, where a worker is asked to answer the task, we select the correct label(s) among multiple given choices. We identify workers' unique characteristics in answering multi-label tasks, and study how it can help to solve the two problems. For question answering, where workers may have diverse qualities across different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. We leverage domain knowledge to accurately model a worker’s quality, and apply them to addressing the two problems. |
Degree | Doctor of Philosophy |
Subject | Database management Human computation |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/250761 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Cheng, CK | - |
dc.contributor.author | Zheng, Yudian | - |
dc.contributor.author | 鄭玉典 | - |
dc.date.accessioned | 2018-01-26T01:59:28Z | - |
dc.date.available | 2018-01-26T01:59:28Z | - |
dc.date.issued | 2017 | - |
dc.identifier.citation | Zheng, Y. [鄭玉典]. (2017). Managing the quality of crowdsourced databases. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/250761 | - |
dc.description.abstract | Many important data management and analytics tasks cannot be completely addressed by automated processes. For example, entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human input. Crowdsourcing platforms are an effective way to harness the capabilities of the crowd to apply human computation for such tasks. In recent years, crowdsourced data management has become an area of increasing interest in research and industry. Typical crowd workers are often associated with a large variety of expertise, background, and quality. As such, the crowdsourced database, which collects information from these workers, may be highly noisy and inaccurate. Thus it is of utter importance to manage the quality of crowdsourced databases. In this thesis, we identify and address two fundamental problems in crowdsourced quality management: (1) Task Assignment, which selects suitable tasks and assigns to appropriate crowd workers; (2) Truth Inference, which aggregates answers obtained from crowd workers to infer the final result. For the task assignment problem, we consider two common settings adopted in existing crowdsourcing solutions: task-based and worker-based. In the task-based setting, given a pool of n tasks, we are interested in which of the k tasks should be assigned to a worker. A poor assignment may not only waste time and money, but may also hurt the quality of a crowdsourcing application that depends on the workers' answers. We propose to consider evaluation metrics (e.g., Accuracy and F-score) that are relevant to an application and we explore how to optimally assign tasks in an online manner. In the worker-based setting, given a monetary budget and a set of workers, we study how workers should be selected, such that the tasks in hand can be accomplished successfully and economically. We observe that this is related to the aggregation of workers' qualities, and propose a solution that optimally aggregates the qualities from different workers, which is fundamental to selecting workers. For the truth inference problem, although there exist extensive solutions, we find that they are not compared extensively under the same framework, and it is hard for practitioners to select appropriate ones. We conduct a detailed survey on 17 existing solutions, and provide an in-depth analysis from various perspectives. Finally, we integrate the task assignment and truth inference in a unified framework, and apply them to two crowdsourcing applications, namely image tagging and question answering. For image tagging, where a worker is asked to answer the task, we select the correct label(s) among multiple given choices. We identify workers' unique characteristics in answering multi-label tasks, and study how it can help to solve the two problems. For question answering, where workers may have diverse qualities across different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. We leverage domain knowledge to accurately model a worker’s quality, and apply them to addressing the two problems. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Database management | - |
dc.subject.lcsh | Human computation | - |
dc.title | Managing the quality of crowdsourced databases | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.5353/th_991043982880403414 | - |
dc.date.hkucongregation | 2017 | - |
dc.identifier.mmsid | 991043982880403414 | - |