Cleaning uncertain data with quality guarantees

Cheng, CK; Chen, J; Xie, X

File Download

re01.htm

Links for fulltext

(May Require Subscription)

Publisher Website: 10.14778/1453856.1453935
Scopus: eid_2-s2.0-61349087255
WOS: WOS:000219595600060
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Information Technology Services: Journal/Magazine Articles

Article: Cleaning uncertain data with quality guarantees

Title	Cleaning uncertain data with quality guarantees
Authors	Cheng, CK Chen, J Xie, X
Issue Date	2008
Publisher	Very Large Data Base (VLDB) Endowment Inc.. The Journal's web site is located at http://vldb.org/pvldb/index.html
Citation	Proceedings of the VLDB Endowment, 2008, v. 1, p. 722-735 How to Cite? DOI: http://dx.doi.org/10.14778/1453856.1453935
Abstract	Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.
Persistent Identifier	http://hdl.handle.net/10722/61148
ISSN	2150-8097 2023 Impact Factor: 2.6 2023 SCImago Journal Rankings: 2.666
ISI Accession Number ID	WOS:000219595600060

DC Field	Value	Language
dc.contributor.author	Cheng, CK	en_HK
dc.contributor.author	Chen, J	-
dc.contributor.author	Xie, X	-
dc.date.accessioned	2010-07-13T03:31:57Z	-
dc.date.available	2010-07-13T03:31:57Z	-
dc.date.issued	2008	en_HK
dc.identifier.citation	Proceedings of the VLDB Endowment, 2008, v. 1, p. 722-735	-
dc.identifier.issn	2150-8097	-
dc.identifier.uri	http://hdl.handle.net/10722/61148	-
dc.description.abstract	Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.	-
dc.language	eng	en_HK
dc.publisher	Very Large Data Base (VLDB) Endowment Inc.. The Journal's web site is located at http://vldb.org/pvldb/index.html	-
dc.relation.ispartof	Proceedings of the VLDB Endowment	-
dc.title	Cleaning uncertain data with quality guarantees	en_HK
dc.type	Article	en_HK
dc.identifier.email	Cheng, CK: ckcheng@cs.hku.hk	en_HK
dc.identifier.authority	Cheng, CK=rp00074	en_HK
dc.description.nature	link_to_OA_fulltext	-
dc.identifier.doi	10.14778/1453856.1453935	-
dc.identifier.scopus	eid_2-s2.0-61349087255	-
dc.identifier.hkuros	150618	en_HK
dc.identifier.volume	1	-
dc.identifier.spage	722	-
dc.identifier.epage	735	-
dc.identifier.isi	WOS:000219595600060	-
dc.identifier.issnl	2150-8097	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Cleaning uncertain data with quality guarantees

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats