File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Article: A machine learning approach to web page filtering using content and structure analysis

TitleA machine learning approach to web page filtering using content and structure analysis
Authors
KeywordsLink analysis
Machine learning
Web mining
Web page classification
Issue Date2008
PublisherElsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/dss
Citation
Decision Support Systems, 2008, v. 44 n. 2, p. 482-494 How to Cite?
AbstractAs the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management. © 2007 Elsevier B.V. All rights reserved.
Persistent Identifierhttp://hdl.handle.net/10722/85786
ISSN
2023 Impact Factor: 6.7
2023 SCImago Journal Rankings: 2.211
ISI Accession Number ID
References

 

DC FieldValueLanguage
dc.contributor.authorChau, Men_HK
dc.contributor.authorChen, Hen_HK
dc.date.accessioned2010-09-06T09:09:14Z-
dc.date.available2010-09-06T09:09:14Z-
dc.date.issued2008en_HK
dc.identifier.citationDecision Support Systems, 2008, v. 44 n. 2, p. 482-494en_HK
dc.identifier.issn0167-9236en_HK
dc.identifier.urihttp://hdl.handle.net/10722/85786-
dc.description.abstractAs the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management. © 2007 Elsevier B.V. All rights reserved.en_HK
dc.languageengen_HK
dc.publisherElsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/dssen_HK
dc.relation.ispartofDecision Support Systemsen_HK
dc.rightsDecision Support Systems. Copyright © Elsevier BV.en_HK
dc.subjectLink analysisen_HK
dc.subjectMachine learningen_HK
dc.subjectWeb miningen_HK
dc.subjectWeb page classificationen_HK
dc.titleA machine learning approach to web page filtering using content and structure analysisen_HK
dc.typeArticleen_HK
dc.identifier.openurlhttp://library.hku.hk:4550/resserv?sid=HKU:IR&issn=0167-9236&volume=44&issue=2&spage=482&epage=494&date=2008&atitle=A+Machine+Learning+Approach+to+Web+Page+Filtering+Using+Content+and+Structure+Analysisen_HK
dc.identifier.emailChau, M: mchau@hkucc.hku.hken_HK
dc.identifier.authorityChau, M=rp01051en_HK
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1016/j.dss.2007.06.002en_HK
dc.identifier.scopuseid_2-s2.0-36249013642en_HK
dc.identifier.hkuros148563en_HK
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-36249013642&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.volume44en_HK
dc.identifier.issue2en_HK
dc.identifier.spage482en_HK
dc.identifier.epage494en_HK
dc.identifier.isiWOS:000252651400008-
dc.publisher.placeNetherlandsen_HK
dc.identifier.scopusauthoridChau, M=7006073763en_HK
dc.identifier.scopusauthoridChen, H=8871373800en_HK
dc.identifier.issnl0167-9236-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats