Effect of data skewness and workload balance in parallel data mining

Cheung, DW; Lee, SD; Xiao, Y

File Download

70955.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TKDE.2002.1000339
Scopus: eid_2-s2.0-0036565561
WOS: WOS:000175317300003
Find via

Supplementary

Bookmarks:
- CiteULike: 1
Citations:
- Scopus: 44
- Web of Science: 0
Appears in Collections:
- Information Technology Services: Journal/Magazine Articles

Article: Effect of data skewness and workload balance in parallel data mining

Title	Effect of data skewness and workload balance in parallel data mining
Authors	Cheung, DW Lee, SD Xiao, Y
Keywords	Association rules Data mining Data skewness Parallel mining Partitioning Workload balance
Issue Date	2002
Publisher	I E E E. The Journal's web site is located at http://www.computer.org/tkde
Citation	Ieee Transactions On Knowledge And Data Engineering, 2002, v. 14 n. 3, p. 498-514 How to Cite? DOI: http://dx.doi.org/10.1109/TKDE.2002.1000339
Abstract	To mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules. FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal. The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently.
Persistent Identifier	http://hdl.handle.net/10722/43659
ISSN	1041-4347 2023 Impact Factor: 8.9 2023 SCImago Journal Rankings: 2.867
ISI Accession Number ID	WOS:000175317300003
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Cheung, DW	en_HK
dc.contributor.author	Lee, SD	en_HK
dc.contributor.author	Xiao, Y	en_HK
dc.date.accessioned	2007-03-23T04:51:26Z	-
dc.date.available	2007-03-23T04:51:26Z	-
dc.date.issued	2002	en_HK
dc.identifier.citation	Ieee Transactions On Knowledge And Data Engineering, 2002, v. 14 n. 3, p. 498-514	en_HK
dc.identifier.issn	1041-4347	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/43659	-
dc.description.abstract	To mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules. FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal. The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently.	en_HK
dc.format.extent	434493 bytes	-
dc.format.extent	26624 bytes	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/msword	-
dc.language	eng	en_HK
dc.publisher	I E E E. The Journal's web site is located at http://www.computer.org/tkde	en_HK
dc.relation.ispartof	IEEE Transactions on Knowledge and Data Engineering	en_HK
dc.rights	©2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.	-
dc.subject	Association rules	en_HK
dc.subject	Data mining	en_HK
dc.subject	Data skewness	en_HK
dc.subject	Parallel mining	en_HK
dc.subject	Partitioning	en_HK
dc.subject	Workload balance	en_HK
dc.title	Effect of data skewness and workload balance in parallel data mining	en_HK
dc.type	Article	en_HK
dc.identifier.openurl	http://library.hku.hk:4550/resserv?sid=HKU:IR&issn=1041-4347&volume=14&issue=3&spage=498&epage=514&date=2002&atitle=Effect+of+data+skewness+and+workload+balance+in+parallel+data+mining	en_HK
dc.identifier.email	Cheung, DW:dcheung@cs.hku.hk	en_HK
dc.identifier.authority	Cheung, DW=rp00101	en_HK
dc.description.nature	published_or_final_version	en_HK
dc.identifier.doi	10.1109/TKDE.2002.1000339	en_HK
dc.identifier.scopus	eid_2-s2.0-0036565561	en_HK
dc.identifier.hkuros	70955	-
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-0036565561&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.volume	14	en_HK
dc.identifier.issue	3	en_HK
dc.identifier.spage	498	en_HK
dc.identifier.epage	514	en_HK
dc.identifier.isi	WOS:000175317300003	-
dc.publisher.place	United States	en_HK
dc.identifier.scopusauthorid	Cheung, DW=34567902600	en_HK
dc.identifier.scopusauthorid	Lee, SD=37056848600	en_HK
dc.identifier.scopusauthorid	Xiao, Y=22735880100	en_HK
dc.identifier.citeulike	8355097	-
dc.identifier.issnl	1041-4347	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Effect of data skewness and workload balance in parallel data mining

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats