File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: STEM: a suffix tree-based method for web data records extraction

TitleSTEM: a suffix tree-based method for web data records extraction
Authors
KeywordsWeb data extraction
Suffix tree
HTML tag path
Data Record pattern
Issue Date2018
PublisherSpringer-Verlag London Ltd. The Journal's web site is located at http://link.springer.de/link/service/journals/10115/
Citation
Knowledge and Information Systems, 2018, v. 55 n. 2, p. 305-331 How to Cite?
AbstractTo automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.
Persistent Identifierhttp://hdl.handle.net/10722/243522
ISSN
2021 Impact Factor: 2.531
2020 SCImago Journal Rankings: 0.634
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorFang, Y-
dc.contributor.authorXie, X-
dc.contributor.authorZhang, X-
dc.contributor.authorCheng, CK-
dc.contributor.authorZhang, Z-
dc.date.accessioned2017-08-25T02:55:57Z-
dc.date.available2017-08-25T02:55:57Z-
dc.date.issued2018-
dc.identifier.citationKnowledge and Information Systems, 2018, v. 55 n. 2, p. 305-331-
dc.identifier.issn0219-1377-
dc.identifier.urihttp://hdl.handle.net/10722/243522-
dc.description.abstractTo automatically extract data records from Web pages, the data record extraction algorithm is required to be robust and efficient. However, most of existing algorithms are not robust enough to cope with rich information or noisy data. In this paper, we propose a novel suffix tree-based extraction method (STEM) for this challenging task. First, we extract a sequence of identifiers from the tag paths of Web pages. Then, a suffix tree is built on top of this sequence and four refining filters are proposed to screen out data regions that might not contain data records. To evaluate model performance, we define an evaluation metric called pattern similarity and perform rigorous experiments on five real data sets. The promising experimental results have demonstrated that the proposed STEM is superior to the state-of-the-art algorithms like MDR, TPC and CTVS with respect to precision, recall and pattern similarity. Moreover, the time complexity of STEM is linear to the total number of HTML tags contained in Web pages, which indicates the potential applicability of STEM in a wide range of Web-scale data record extraction applications.-
dc.languageeng-
dc.publisherSpringer-Verlag London Ltd. The Journal's web site is located at http://link.springer.de/link/service/journals/10115/-
dc.relation.ispartofKnowledge and Information Systems-
dc.rightsThe final publication is available at Springer via http://dx.doi.org/[insert DOI]-
dc.subjectWeb data extraction-
dc.subjectSuffix tree-
dc.subjectHTML tag path-
dc.subjectData Record pattern-
dc.titleSTEM: a suffix tree-based method for web data records extraction-
dc.typeArticle-
dc.identifier.emailCheng, CK: ckcheng@cs.hku.hk-
dc.identifier.authorityCheng, CK=rp00074-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1007/s10115-017-1062-0-
dc.identifier.scopuseid_2-s2.0-85019121195-
dc.identifier.hkuros275445-
dc.identifier.volume55-
dc.identifier.issue2-
dc.identifier.spage305-
dc.identifier.epage331-
dc.identifier.isiWOS:000427971400002-
dc.publisher.placeUnited Kingdom-
dc.identifier.issnl0219-3116-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats