STEM: a suffix tree-based method for web data records extraction

Fang, Y; Xie, X; Zhang, X; Cheng, CK; Zhang, Z

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/s10115-017-1062-0
Scopus: eid_2-s2.0-85019121195
WOS: WOS:000427971400002
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: STEM: a suffix tree-based method for web data records extraction

Title	STEM: a suffix tree-based method for web data records extraction
Authors	Fang, Y Xie, X Zhang, X Cheng, CK Zhang, Z
Keywords	Web data extraction Suffix tree HTML tag path Data Record pattern
Issue Date	2018
Publisher	Springer-Verlag London Ltd. The Journal's web site is located at http://link.springer.de/link/service/journals/10115/
Citation	Knowledge and Information Systems, 2018, v. 55 n. 2, p. 305-331 How to Cite? DOI: http://dx.doi.org/10.1007/s10115-017-1062-0
Abstract	To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.
Persistent Identifier	http://hdl.handle.net/10722/243522
ISSN	0219-1377 2023 Impact Factor: 2.5 2023 SCImago Journal Rankings: 0.860
ISI Accession Number ID	WOS:000427971400002

DC Field	Value	Language
dc.contributor.author	Fang, Y	-
dc.contributor.author	Xie, X	-
dc.contributor.author	Zhang, X	-
dc.contributor.author	Cheng, CK	-
dc.contributor.author	Zhang, Z	-
dc.date.accessioned	2017-08-25T02:55:57Z	-
dc.date.available	2017-08-25T02:55:57Z	-
dc.date.issued	2018	-
dc.identifier.citation	Knowledge and Information Systems, 2018, v. 55 n. 2, p. 305-331	-
dc.identifier.issn	0219-1377	-
dc.identifier.uri	http://hdl.handle.net/10722/243522	-
dc.description.abstract	To automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.	-
dc.language	eng	-
dc.publisher	Springer-Verlag London Ltd. The Journal's web site is located at http://link.springer.de/link/service/journals/10115/	-
dc.relation.ispartof	Knowledge and Information Systems	-
dc.rights	The final publication is available at Springer via http://dx.doi.org/[insert DOI]	-
dc.subject	Web data extraction	-
dc.subject	Suffix tree	-
dc.subject	HTML tag path	-
dc.subject	Data Record pattern	-
dc.title	STEM: a suffix tree-based method for web data records extraction	-
dc.type	Article	-
dc.identifier.email	Cheng, CK: ckcheng@cs.hku.hk	-
dc.identifier.authority	Cheng, CK=rp00074	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/s10115-017-1062-0	-
dc.identifier.scopus	eid_2-s2.0-85019121195	-
dc.identifier.hkuros	275445	-
dc.identifier.volume	55	-
dc.identifier.issue	2	-
dc.identifier.spage	305	-
dc.identifier.epage	331	-
dc.identifier.isi	WOS:000427971400002	-
dc.publisher.place	United Kingdom	-
dc.identifier.issnl	0219-3116	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: STEM: a suffix tree-based method for web data records extraction

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats