File Download
 
Links for fulltext
(May Require Subscription)
 
Supplementary

Article: Effect of data skewness and workload balance in parallel data mining
  • Basic View
  • Metadata View
  • XML View
TitleEffect of data skewness and workload balance in parallel data mining
 
AuthorsCheung, DW2 1
Lee, SD1
Xiao, Y1
 
KeywordsAssociation rules
Data mining
Data skewness
Parallel mining
Partitioning
Workload balance
 
Issue Date2002
 
PublisherI E E E. The Journal's web site is located at http://www.computer.org/tkde
 
CitationIeee Transactions On Knowledge And Data Engineering, 2002, v. 14 n. 3, p. 498-514 [How to Cite?]
DOI: http://dx.doi.org/10.1109/TKDE.2002.1000339
 
AbstractTo mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules. FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal. The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently.
 
ISSN1041-4347
2013 Impact Factor: 1.815
2013 SCImago Journal Rankings: 1.763
 
DOIhttp://dx.doi.org/10.1109/TKDE.2002.1000339
 
ISI Accession Number IDWOS:000175317300003
 
ReferencesReferences in Scopus
 
DC FieldValue
dc.contributor.authorCheung, DW
 
dc.contributor.authorLee, SD
 
dc.contributor.authorXiao, Y
 
dc.date.accessioned2007-03-23T04:51:26Z
 
dc.date.available2007-03-23T04:51:26Z
 
dc.date.issued2002
 
dc.description.abstractTo mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules. FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal. The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently.
 
dc.description.naturepublished_or_final_version
 
dc.format.extent434493 bytes
 
dc.format.extent26624 bytes
 
dc.format.mimetypeapplication/pdf
 
dc.format.mimetypeapplication/msword
 
dc.identifier.citationIeee Transactions On Knowledge And Data Engineering, 2002, v. 14 n. 3, p. 498-514 [How to Cite?]
DOI: http://dx.doi.org/10.1109/TKDE.2002.1000339
 
dc.identifier.citeulike8355097
 
dc.identifier.doihttp://dx.doi.org/10.1109/TKDE.2002.1000339
 
dc.identifier.epage514
 
dc.identifier.hkuros70955
 
dc.identifier.isiWOS:000175317300003
 
dc.identifier.issn1041-4347
2013 Impact Factor: 1.815
2013 SCImago Journal Rankings: 1.763
 
dc.identifier.issue3
 
dc.identifier.openurl
 
dc.identifier.scopuseid_2-s2.0-0036565561
 
dc.identifier.spage498
 
dc.identifier.urihttp://hdl.handle.net/10722/43659
 
dc.identifier.volume14
 
dc.languageeng
 
dc.publisherI E E E. The Journal's web site is located at http://www.computer.org/tkde
 
dc.publisher.placeUnited States
 
dc.relation.ispartofIEEE Transactions on Knowledge and Data Engineering
 
dc.relation.referencesReferences in Scopus
 
dc.rights©2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
 
dc.rightsCreative Commons: Attribution 3.0 Hong Kong License
 
dc.subjectAssociation rules
 
dc.subjectData mining
 
dc.subjectData skewness
 
dc.subjectParallel mining
 
dc.subjectPartitioning
 
dc.subjectWorkload balance
 
dc.titleEffect of data skewness and workload balance in parallel data mining
 
dc.typeArticle
 
<?xml encoding="utf-8" version="1.0"?>
<item><contributor.author>Cheung, DW</contributor.author>
<contributor.author>Lee, SD</contributor.author>
<contributor.author>Xiao, Y</contributor.author>
<date.accessioned>2007-03-23T04:51:26Z</date.accessioned>
<date.available>2007-03-23T04:51:26Z</date.available>
<date.issued>2002</date.issued>
<identifier.citation>Ieee Transactions On Knowledge And Data Engineering, 2002, v. 14 n. 3, p. 498-514</identifier.citation>
<identifier.issn>1041-4347</identifier.issn>
<identifier.uri>http://hdl.handle.net/10722/43659</identifier.uri>
<description.abstract>To mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules. FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal. The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently.</description.abstract>
<format.extent>434493 bytes</format.extent>
<format.extent>26624 bytes</format.extent>
<format.mimetype>application/pdf</format.mimetype>
<format.mimetype>application/msword</format.mimetype>
<language>eng</language>
<publisher>I E E E. The Journal&apos;s web site is located at http://www.computer.org/tkde</publisher>
<relation.ispartof>IEEE Transactions on Knowledge and Data Engineering</relation.ispartof>
<rights>&#169;2002 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.</rights>
<rights>Creative Commons: Attribution 3.0 Hong Kong License</rights>
<subject>Association rules</subject>
<subject>Data mining</subject>
<subject>Data skewness</subject>
<subject>Parallel mining</subject>
<subject>Partitioning</subject>
<subject>Workload balance</subject>
<title>Effect of data skewness and workload balance in parallel data mining</title>
<type>Article</type>
<identifier.openurl>http://library.hku.hk:4550/resserv?sid=HKU:IR&amp;issn=1041-4347&amp;volume=14&amp;issue=3&amp;spage=498&amp;epage=514&amp;date=2002&amp;atitle=Effect+of+data+skewness+and+workload+balance+in+parallel+data+mining</identifier.openurl>
<description.nature>published_or_final_version</description.nature>
<identifier.doi>10.1109/TKDE.2002.1000339</identifier.doi>
<identifier.scopus>eid_2-s2.0-0036565561</identifier.scopus>
<identifier.hkuros>70955</identifier.hkuros>
<relation.references>http://www.scopus.com/mlt/select.url?eid=2-s2.0-0036565561&amp;selection=ref&amp;src=s&amp;origin=recordpage</relation.references>
<identifier.volume>14</identifier.volume>
<identifier.issue>3</identifier.issue>
<identifier.spage>498</identifier.spage>
<identifier.epage>514</identifier.epage>
<identifier.isi>WOS:000175317300003</identifier.isi>
<publisher.place>United States</publisher.place>
<identifier.citeulike>8355097</identifier.citeulike>
<bitstream.url>http://hub.hku.hk/bitstream/10722/43659/1/70955.pdf</bitstream.url>
</item>
Author Affiliations
  1. The University of Hong Kong
  2. IEEE