File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TKDE.2004.1264824
- Scopus: eid_2-s2.0-0742268827
- WOS: WOS:000187435500008
- Find via
Supplementary
-
Bookmarks:
- CiteULike: 1
- Citations:
- Appears in Collections:
Article: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
Title | An Efficient and Scalable Algorithm for Clustering XML Documents by Structure |
---|---|
Authors | |
Keywords | Clustering Data mining Query processing Semistructured data XML |
Issue Date | 2004 |
Publisher | I E E E. The Journal's web site is located at http://www.computer.org/tkde |
Citation | Ieee Transactions On Knowledge And Data Engineering, 2004, v. 16 n. 1, p. 82-96 How to Cite? |
Abstract | With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection. |
Persistent Identifier | http://hdl.handle.net/10722/43670 |
ISSN | 2023 Impact Factor: 8.9 2023 SCImago Journal Rankings: 2.867 |
ISI Accession Number ID | |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Lian, W | en_HK |
dc.contributor.author | Cheung, DWL | en_HK |
dc.contributor.author | Mamoulis, N | en_HK |
dc.contributor.author | Yiu, SM | en_HK |
dc.date.accessioned | 2007-03-23T04:51:40Z | - |
dc.date.available | 2007-03-23T04:51:40Z | - |
dc.date.issued | 2004 | en_HK |
dc.identifier.citation | Ieee Transactions On Knowledge And Data Engineering, 2004, v. 16 n. 1, p. 82-96 | en_HK |
dc.identifier.issn | 1041-4347 | en_HK |
dc.identifier.uri | http://hdl.handle.net/10722/43670 | - |
dc.description.abstract | With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection. | en_HK |
dc.format.extent | 1822344 bytes | - |
dc.format.extent | 26624 bytes | - |
dc.format.mimetype | application/pdf | - |
dc.format.mimetype | application/msword | - |
dc.language | eng | en_HK |
dc.publisher | I E E E. The Journal's web site is located at http://www.computer.org/tkde | en_HK |
dc.relation.ispartof | IEEE Transactions on Knowledge and Data Engineering | en_HK |
dc.rights | ©2004 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. | - |
dc.subject | Clustering | en_HK |
dc.subject | Data mining | en_HK |
dc.subject | Query processing | en_HK |
dc.subject | Semistructured data | en_HK |
dc.subject | XML | en_HK |
dc.title | An Efficient and Scalable Algorithm for Clustering XML Documents by Structure | en_HK |
dc.type | Article | en_HK |
dc.identifier.openurl | http://library.hku.hk:4550/resserv?sid=HKU:IR&issn=1041-4347&volume=16&issue=1&spage=82&epage=96&date=2004&atitle=An+efficient+and+scalable+algorithm+for+clustering+XML+documents+by+structure | en_HK |
dc.identifier.email | Cheung, DWL:dcheung@cs.hku.hk | en_HK |
dc.identifier.email | Mamoulis, N:nikos@cs.hku.hk | en_HK |
dc.identifier.email | Yiu, SM:smyiu@cs.hku.hk | en_HK |
dc.identifier.authority | Cheung, DWL=rp00101 | en_HK |
dc.identifier.authority | Mamoulis, N=rp00155 | en_HK |
dc.identifier.authority | Yiu, SM=rp00207 | en_HK |
dc.description.nature | published_or_final_version | en_HK |
dc.identifier.doi | 10.1109/TKDE.2004.1264824 | en_HK |
dc.identifier.scopus | eid_2-s2.0-0742268827 | en_HK |
dc.identifier.hkuros | 95426 | - |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-0742268827&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.volume | 16 | en_HK |
dc.identifier.issue | 1 | en_HK |
dc.identifier.spage | 82 | en_HK |
dc.identifier.epage | 96 | en_HK |
dc.identifier.isi | WOS:000187435500008 | - |
dc.publisher.place | United States | en_HK |
dc.identifier.scopusauthorid | Lian, W=22433603900 | en_HK |
dc.identifier.scopusauthorid | Cheung, DWL=34567902600 | en_HK |
dc.identifier.scopusauthorid | Mamoulis, N=6701782749 | en_HK |
dc.identifier.scopusauthorid | Yiu, SM=7003282240 | en_HK |
dc.identifier.citeulike | 7016801 | - |
dc.identifier.issnl | 1041-4347 | - |