File Download
Supplementary

postgraduate thesis: Cleaning algorithms for novel applications

TitleCleaning algorithms for novel applications
Authors
Issue Date2015
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Mo, L. [莫璐怡]. (2015). Cleaning algorithms for novel applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5699933
AbstractThe information managed in emerging applications, such as location-based service, sensor network, and crowdsourcing system, is usually imperfect. In many situations, data can be cleaned (e.g., removed or reduced) by performing appropriate operations. In this thesis, we study the cleaning problem under limited resources for two novel applications: querying probabilistic data, and collecting data from human intelligence tasks in crowdsourcing environments. Probabilistic databases have been developed to handle uncertain data recently. For example, the temperature readings in a sensor network may be uncertain due to the lack of latest readings from sensors at every moment. A probabilistic database is able to capture the real value distributions of the readings, and enables evaluation of probabilistic queries on the data. However, data uncertainty may lead to ambiguous query results. By performing cleaning operations on the data, for example, probing some sensors for their latest readings, the ambiguity in query results can be reduced. In this thesis, we first study how to quantify the ambiguity of query results returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database in order to improve top-k query quality. Specifically, we consider the facts that cleaning may involve a cost and fail. We propose optimal cleaning algorithms as well as several heuristics to select the data to clean under a limited budget. In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified to collect data. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may be casual Internet users, their answers are hardly perfect. If more workers are employed to perform a HIT, the quality of the HIT’s answer could be statistically improved. Hence, assigning the number of workers (or plurality) of each HIT is an effective way to reduce (or clean) the imperfectness of the collected data (i.e., HITs answers). In this thesis, we address the important problem of determining the plurality of each HIT so that the overall answer quality is optimized. We propose a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP). We identify two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT’s answer increases monotonically at a decreasing rate with its plurality. We show for HITs that satisfy the two properties (e.g., multiple-choice-question HITs), the PAP is approximable. We propose an efficient greedy algorithm for such case.
DegreeDoctor of Philosophy
SubjectDatabase management
Data mining
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/223008

 

DC FieldValueLanguage
dc.contributor.authorMo, Luyi-
dc.contributor.author莫璐怡-
dc.date.accessioned2016-02-17T23:14:30Z-
dc.date.available2016-02-17T23:14:30Z-
dc.date.issued2015-
dc.identifier.citationMo, L. [莫璐怡]. (2015). Cleaning algorithms for novel applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5699933-
dc.identifier.urihttp://hdl.handle.net/10722/223008-
dc.description.abstractThe information managed in emerging applications, such as location-based service, sensor network, and crowdsourcing system, is usually imperfect. In many situations, data can be cleaned (e.g., removed or reduced) by performing appropriate operations. In this thesis, we study the cleaning problem under limited resources for two novel applications: querying probabilistic data, and collecting data from human intelligence tasks in crowdsourcing environments. Probabilistic databases have been developed to handle uncertain data recently. For example, the temperature readings in a sensor network may be uncertain due to the lack of latest readings from sensors at every moment. A probabilistic database is able to capture the real value distributions of the readings, and enables evaluation of probabilistic queries on the data. However, data uncertainty may lead to ambiguous query results. By performing cleaning operations on the data, for example, probing some sensors for their latest readings, the ambiguity in query results can be reduced. In this thesis, we first study how to quantify the ambiguity of query results returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database in order to improve top-k query quality. Specifically, we consider the facts that cleaning may involve a cost and fail. We propose optimal cleaning algorithms as well as several heuristics to select the data to clean under a limited budget. In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified to collect data. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may be casual Internet users, their answers are hardly perfect. If more workers are employed to perform a HIT, the quality of the HIT’s answer could be statistically improved. Hence, assigning the number of workers (or plurality) of each HIT is an effective way to reduce (or clean) the imperfectness of the collected data (i.e., HITs answers). In this thesis, we address the important problem of determining the plurality of each HIT so that the overall answer quality is optimized. We propose a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP). We identify two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT’s answer increases monotonically at a decreasing rate with its plurality. We show for HITs that satisfy the two properties (e.g., multiple-choice-question HITs), the PAP is approximable. We propose an efficient greedy algorithm for such case.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsCreative Commons: Attribution 3.0 Hong Kong License-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.subject.lcshDatabase management-
dc.subject.lcshData mining-
dc.titleCleaning algorithms for novel applications-
dc.typePG_Thesis-
dc.identifier.hkulb5699933-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats