File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Conference Paper: A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization
Title | A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization |
---|---|
Authors | |
Issue Date | 2002 |
Citation | Proceedings - International Conference On Pattern Recognition, 2002, v. 16 n. 4, p. 235-238 How to Cite? |
Abstract | Associating documents to relevant categories is critical for effective document retrieval. Here, we compare the well-known k-Nearest Neighborhood (kNN) algorithm, the centroid-based classifier and the Highest Average Similarity over Retrieved Documents (HASRD) algorithm, for effective document categorization. We use various measures such as the micro and macro F1 values to evaluate their performance on the Reuters-21578 corpus. The empirical results show that kNN performs the best, followed by our adapted HASRD and the centroid-based classifier for common document categories, while the centroid-based classifier and kNN outperform our adapted HASRD for rare document categories. Additionally, our study clearly indicates that each classifier performs optimally only when a suitable term weighting scheme is used. All these significant results lead to many exciting directions for future exploration. © 2002 IEEE. |
Persistent Identifier | http://hdl.handle.net/10722/158425 |
ISSN | 2023 SCImago Journal Rankings: 0.584 |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Tam, V | en_US |
dc.contributor.author | Santoso, A | en_US |
dc.contributor.author | Setiono, R | en_US |
dc.date.accessioned | 2012-08-08T08:59:33Z | - |
dc.date.available | 2012-08-08T08:59:33Z | - |
dc.date.issued | 2002 | en_US |
dc.identifier.citation | Proceedings - International Conference On Pattern Recognition, 2002, v. 16 n. 4, p. 235-238 | en_US |
dc.identifier.issn | 1051-4651 | en_US |
dc.identifier.uri | http://hdl.handle.net/10722/158425 | - |
dc.description.abstract | Associating documents to relevant categories is critical for effective document retrieval. Here, we compare the well-known k-Nearest Neighborhood (kNN) algorithm, the centroid-based classifier and the Highest Average Similarity over Retrieved Documents (HASRD) algorithm, for effective document categorization. We use various measures such as the micro and macro F1 values to evaluate their performance on the Reuters-21578 corpus. The empirical results show that kNN performs the best, followed by our adapted HASRD and the centroid-based classifier for common document categories, while the centroid-based classifier and kNN outperform our adapted HASRD for rare document categories. Additionally, our study clearly indicates that each classifier performs optimally only when a suitable term weighting scheme is used. All these significant results lead to many exciting directions for future exploration. © 2002 IEEE. | en_US |
dc.language | eng | en_US |
dc.relation.ispartof | Proceedings - International Conference on Pattern Recognition | en_US |
dc.title | A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization | en_US |
dc.type | Conference_Paper | en_US |
dc.identifier.email | Tam, V:vtam@eee.hku.hk | en_US |
dc.identifier.authority | Tam, V=rp00173 | en_US |
dc.description.nature | link_to_subscribed_fulltext | en_US |
dc.identifier.scopus | eid_2-s2.0-29144522357 | en_US |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-29144522357&selection=ref&src=s&origin=recordpage | en_US |
dc.identifier.volume | 16 | en_US |
dc.identifier.issue | 4 | en_US |
dc.identifier.spage | 235 | en_US |
dc.identifier.epage | 238 | en_US |
dc.publisher.place | United States | en_US |
dc.identifier.scopusauthorid | Tam, V=7005091988 | en_US |
dc.identifier.scopusauthorid | Santoso, A=6601931777 | en_US |
dc.identifier.scopusauthorid | Setiono, R=7005033162 | en_US |
dc.identifier.issnl | 1051-4651 | - |