An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

Lian, W; Cheung, DWL; Mamoulis, N; Yiu, SM

File Download

95426.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TKDE.2004.1264824
Scopus: eid_2-s2.0-0742268827
WOS: WOS:000187435500008
Find via

Supplementary

Bookmarks:
- CiteULike: 1
Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Information Technology Services: Journal/Magazine Articles

Article: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

Title	An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
Authors	Lian, W Cheung, DWL Mamoulis, N Yiu, SM
Keywords	Clustering Data mining Query processing Semistructured data XML
Issue Date	2004
Publisher	I E E E. The Journal's web site is located at http://www.computer.org/tkde
Citation	Ieee Transactions On Knowledge And Data Engineering, 2004, v. 16 n. 1, p. 82-96 How to Cite? DOI: http://dx.doi.org/10.1109/TKDE.2004.1264824
Abstract	With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.
Persistent Identifier	http://hdl.handle.net/10722/43670
ISSN	1041-4347 2023 Impact Factor: 8.9 2023 SCImago Journal Rankings: 2.867
ISI Accession Number ID	WOS:000187435500008
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Lian, W	en_HK
dc.contributor.author	Cheung, DWL	en_HK
dc.contributor.author	Mamoulis, N	en_HK
dc.contributor.author	Yiu, SM	en_HK
dc.date.accessioned	2007-03-23T04:51:40Z	-
dc.date.available	2007-03-23T04:51:40Z	-
dc.date.issued	2004	en_HK
dc.identifier.citation	Ieee Transactions On Knowledge And Data Engineering, 2004, v. 16 n. 1, p. 82-96	en_HK
dc.identifier.issn	1041-4347	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/43670	-
dc.description.abstract	With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.	en_HK
dc.format.extent	1822344 bytes	-
dc.format.extent	26624 bytes	-
dc.format.mimetype	application/pdf	-
dc.format.mimetype	application/msword	-
dc.language	eng	en_HK
dc.publisher	I E E E. The Journal's web site is located at http://www.computer.org/tkde	en_HK
dc.relation.ispartof	IEEE Transactions on Knowledge and Data Engineering	en_HK
dc.rights	©2004 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.	-
dc.subject	Clustering	en_HK
dc.subject	Data mining	en_HK
dc.subject	Query processing	en_HK
dc.subject	Semistructured data	en_HK
dc.subject	XML	en_HK
dc.title	An Efficient and Scalable Algorithm for Clustering XML Documents by Structure	en_HK
dc.type	Article	en_HK
dc.identifier.openurl	http://library.hku.hk:4550/resserv?sid=HKU:IR&issn=1041-4347&volume=16&issue=1&spage=82&epage=96&date=2004&atitle=An+efficient+and+scalable+algorithm+for+clustering+XML+documents+by+structure	en_HK
dc.identifier.email	Cheung, DWL:dcheung@cs.hku.hk	en_HK
dc.identifier.email	Mamoulis, N:nikos@cs.hku.hk	en_HK
dc.identifier.email	Yiu, SM:smyiu@cs.hku.hk	en_HK
dc.identifier.authority	Cheung, DWL=rp00101	en_HK
dc.identifier.authority	Mamoulis, N=rp00155	en_HK
dc.identifier.authority	Yiu, SM=rp00207	en_HK
dc.description.nature	published_or_final_version	en_HK
dc.identifier.doi	10.1109/TKDE.2004.1264824	en_HK
dc.identifier.scopus	eid_2-s2.0-0742268827	en_HK
dc.identifier.hkuros	95426	-
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-0742268827&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.volume	16	en_HK
dc.identifier.issue	1	en_HK
dc.identifier.spage	82	en_HK
dc.identifier.epage	96	en_HK
dc.identifier.isi	WOS:000187435500008	-
dc.publisher.place	United States	en_HK
dc.identifier.scopusauthorid	Lian, W=22433603900	en_HK
dc.identifier.scopusauthorid	Cheung, DWL=34567902600	en_HK
dc.identifier.scopusauthorid	Mamoulis, N=6701782749	en_HK
dc.identifier.scopusauthorid	Yiu, SM=7003282240	en_HK
dc.identifier.citeulike	7016801	-
dc.identifier.issnl	1041-4347	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats