File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1016/j.dss.2007.06.002
- Scopus: eid_2-s2.0-36249013642
- WOS: WOS:000252651400008
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: A machine learning approach to web page filtering using content and structure analysis
Title | A machine learning approach to web page filtering using content and structure analysis |
---|---|
Authors | |
Keywords | Link analysis Machine learning Web mining Web page classification |
Issue Date | 2008 |
Publisher | Elsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/dss |
Citation | Decision Support Systems, 2008, v. 44 n. 2, p. 482-494 How to Cite? |
Abstract | As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management. © 2007 Elsevier B.V. All rights reserved. |
Persistent Identifier | http://hdl.handle.net/10722/85786 |
ISSN | 2023 Impact Factor: 6.7 2023 SCImago Journal Rankings: 2.211 |
ISI Accession Number ID | |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Chau, M | en_HK |
dc.contributor.author | Chen, H | en_HK |
dc.date.accessioned | 2010-09-06T09:09:14Z | - |
dc.date.available | 2010-09-06T09:09:14Z | - |
dc.date.issued | 2008 | en_HK |
dc.identifier.citation | Decision Support Systems, 2008, v. 44 n. 2, p. 482-494 | en_HK |
dc.identifier.issn | 0167-9236 | en_HK |
dc.identifier.uri | http://hdl.handle.net/10722/85786 | - |
dc.description.abstract | As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management. © 2007 Elsevier B.V. All rights reserved. | en_HK |
dc.language | eng | en_HK |
dc.publisher | Elsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/dss | en_HK |
dc.relation.ispartof | Decision Support Systems | en_HK |
dc.rights | Decision Support Systems. Copyright © Elsevier BV. | en_HK |
dc.subject | Link analysis | en_HK |
dc.subject | Machine learning | en_HK |
dc.subject | Web mining | en_HK |
dc.subject | Web page classification | en_HK |
dc.title | A machine learning approach to web page filtering using content and structure analysis | en_HK |
dc.type | Article | en_HK |
dc.identifier.openurl | http://library.hku.hk:4550/resserv?sid=HKU:IR&issn=0167-9236&volume=44&issue=2&spage=482&epage=494&date=2008&atitle=A+Machine+Learning+Approach+to+Web+Page+Filtering+Using+Content+and+Structure+Analysis | en_HK |
dc.identifier.email | Chau, M: mchau@hkucc.hku.hk | en_HK |
dc.identifier.authority | Chau, M=rp01051 | en_HK |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1016/j.dss.2007.06.002 | en_HK |
dc.identifier.scopus | eid_2-s2.0-36249013642 | en_HK |
dc.identifier.hkuros | 148563 | en_HK |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-36249013642&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.volume | 44 | en_HK |
dc.identifier.issue | 2 | en_HK |
dc.identifier.spage | 482 | en_HK |
dc.identifier.epage | 494 | en_HK |
dc.identifier.isi | WOS:000252651400008 | - |
dc.publisher.place | Netherlands | en_HK |
dc.identifier.scopusauthorid | Chau, M=7006073763 | en_HK |
dc.identifier.scopusauthorid | Chen, H=8871373800 | en_HK |
dc.identifier.issnl | 0167-9236 | - |