A machine learning approach to web page filtering using content and structure analysis

Chau, M; Chen, H

File Download

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1016/j.dss.2007.06.002
Scopus: eid_2-s2.0-36249013642
WOS: WOS:000252651400008
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Faculty of Business & Economics: Journal/Magazine Articles

Article: A machine learning approach to web page filtering using content and structure analysis

Title	A machine learning approach to web page filtering using content and structure analysis
Authors	Chau, M Chen, H
Keywords	Link analysis Machine learning Web mining Web page classification
Issue Date	2008
Publisher	Elsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/dss
Citation	Decision Support Systems, 2008, v. 44 n. 2, p. 482-494 How to Cite? DOI: http://dx.doi.org/10.1016/j.dss.2007.06.002
Abstract	As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management. © 2007 Elsevier B.V. All rights reserved.
Persistent Identifier	http://hdl.handle.net/10722/85786
ISSN	0167-9236 2023 Impact Factor: 6.7 2023 SCImago Journal Rankings: 2.211
ISI Accession Number ID	WOS:000252651400008
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Chau, M	en_HK
dc.contributor.author	Chen, H	en_HK
dc.date.accessioned	2010-09-06T09:09:14Z	-
dc.date.available	2010-09-06T09:09:14Z	-
dc.date.issued	2008	en_HK
dc.identifier.citation	Decision Support Systems, 2008, v. 44 n. 2, p. 482-494	en_HK
dc.identifier.issn	0167-9236	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/85786	-
dc.description.abstract	As the Web continues to grow, it has become increasingly difficult to search for relevant information using traditional search engines. Topic-specific search engines provide an alternative way to support efficient information retrieval on the Web by providing more precise and customized searching in various domains. However, developers of topic-specific search engines need to address two issues: how to locate relevant documents (URLs) on the Web and how to filter out irrelevant documents from a set of documents collected from the Web. This paper reports our research in addressing the second issue. We propose a machine-learning-based approach that combines Web content analysis and Web structure analysis. We represent each Web page by a set of content-based and link-based features, which can be used as the input for various machine learning algorithms. The proposed approach was implemented using both a feedforward/backpropagation neural network and a support vector machine. Two experiments were designed and conducted to compare the proposed Web-feature approach with two existing Web page filtering methods - a keyword-based approach and a lexicon-based approach. The experimental results showed that the proposed approach in general performed better than the benchmark approaches, especially when the number of training documents was small. The proposed approaches can be applied in topic-specific search engine development and other Web applications such as Web content management. © 2007 Elsevier B.V. All rights reserved.	en_HK
dc.language	eng	en_HK
dc.publisher	Elsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/dss	en_HK
dc.relation.ispartof	Decision Support Systems	en_HK
dc.rights	Decision Support Systems. Copyright © Elsevier BV.	en_HK
dc.subject	Link analysis	en_HK
dc.subject	Machine learning	en_HK
dc.subject	Web mining	en_HK
dc.subject	Web page classification	en_HK
dc.title	A machine learning approach to web page filtering using content and structure analysis	en_HK
dc.type	Article	en_HK
dc.identifier.openurl	http://library.hku.hk:4550/resserv?sid=HKU:IR&issn=0167-9236&volume=44&issue=2&spage=482&epage=494&date=2008&atitle=A+Machine+Learning+Approach+to+Web+Page+Filtering+Using+Content+and+Structure+Analysis	en_HK
dc.identifier.email	Chau, M: mchau@hkucc.hku.hk	en_HK
dc.identifier.authority	Chau, M=rp01051	en_HK
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1016/j.dss.2007.06.002	en_HK
dc.identifier.scopus	eid_2-s2.0-36249013642	en_HK
dc.identifier.hkuros	148563	en_HK
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-36249013642&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.volume	44	en_HK
dc.identifier.issue	2	en_HK
dc.identifier.spage	482	en_HK
dc.identifier.epage	494	en_HK
dc.identifier.isi	WOS:000252651400008	-
dc.publisher.place	Netherlands	en_HK
dc.identifier.scopusauthorid	Chau, M=7006073763	en_HK
dc.identifier.scopusauthorid	Chen, H=8871373800	en_HK
dc.identifier.issnl	0167-9236	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: A machine learning approach to web page filtering using content and structure analysis

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats