File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Article: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

TitleAn Efficient and Scalable Algorithm for Clustering XML Documents by Structure
Authors
KeywordsClustering
Data mining
Query processing
Semistructured data
XML
Issue Date2004
PublisherI E E E. The Journal's web site is located at http://www.computer.org/tkde
Citation
Ieee Transactions On Knowledge And Data Engineering, 2004, v. 16 n. 1, p. 82-96 How to Cite?
AbstractWith the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.
Persistent Identifierhttp://hdl.handle.net/10722/43670
ISSN
2023 Impact Factor: 8.9
2023 SCImago Journal Rankings: 2.867
ISI Accession Number ID
References

 

DC FieldValueLanguage
dc.contributor.authorLian, Wen_HK
dc.contributor.authorCheung, DWLen_HK
dc.contributor.authorMamoulis, Nen_HK
dc.contributor.authorYiu, SMen_HK
dc.date.accessioned2007-03-23T04:51:40Z-
dc.date.available2007-03-23T04:51:40Z-
dc.date.issued2004en_HK
dc.identifier.citationIeee Transactions On Knowledge And Data Engineering, 2004, v. 16 n. 1, p. 82-96en_HK
dc.identifier.issn1041-4347en_HK
dc.identifier.urihttp://hdl.handle.net/10722/43670-
dc.description.abstractWith the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.en_HK
dc.format.extent1822344 bytes-
dc.format.extent26624 bytes-
dc.format.mimetypeapplication/pdf-
dc.format.mimetypeapplication/msword-
dc.languageengen_HK
dc.publisherI E E E. The Journal's web site is located at http://www.computer.org/tkdeen_HK
dc.relation.ispartofIEEE Transactions on Knowledge and Data Engineeringen_HK
dc.rights©2004 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.-
dc.subjectClusteringen_HK
dc.subjectData miningen_HK
dc.subjectQuery processingen_HK
dc.subjectSemistructured dataen_HK
dc.subjectXMLen_HK
dc.titleAn Efficient and Scalable Algorithm for Clustering XML Documents by Structureen_HK
dc.typeArticleen_HK
dc.identifier.openurlhttp://library.hku.hk:4550/resserv?sid=HKU:IR&issn=1041-4347&volume=16&issue=1&spage=82&epage=96&date=2004&atitle=An+efficient+and+scalable+algorithm+for+clustering+XML+documents+by+structureen_HK
dc.identifier.emailCheung, DWL:dcheung@cs.hku.hken_HK
dc.identifier.emailMamoulis, N:nikos@cs.hku.hken_HK
dc.identifier.emailYiu, SM:smyiu@cs.hku.hken_HK
dc.identifier.authorityCheung, DWL=rp00101en_HK
dc.identifier.authorityMamoulis, N=rp00155en_HK
dc.identifier.authorityYiu, SM=rp00207en_HK
dc.description.naturepublished_or_final_versionen_HK
dc.identifier.doi10.1109/TKDE.2004.1264824en_HK
dc.identifier.scopuseid_2-s2.0-0742268827en_HK
dc.identifier.hkuros95426-
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-0742268827&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.volume16en_HK
dc.identifier.issue1en_HK
dc.identifier.spage82en_HK
dc.identifier.epage96en_HK
dc.identifier.isiWOS:000187435500008-
dc.publisher.placeUnited Statesen_HK
dc.identifier.scopusauthoridLian, W=22433603900en_HK
dc.identifier.scopusauthoridCheung, DWL=34567902600en_HK
dc.identifier.scopusauthoridMamoulis, N=6701782749en_HK
dc.identifier.scopusauthoridYiu, SM=7003282240en_HK
dc.identifier.citeulike7016801-
dc.identifier.issnl1041-4347-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats