File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Decomposing background topics from keywords by Principal Component Pursuit

TitleDecomposing background topics from keywords by Principal Component Pursuit
Authors
KeywordsLatent Dirichlet Allocation
Latent Semantic Indexing
Perplexity
Principal Component Pursuit
Sparse keywords
Issue Date2010
Citation
International Conference on Information and Knowledge Management, Proceedings, 2010, p. 269-277 How to Cite?
AbstractLow-dimensional topic models have been proven very useful for modeling a large corpus of documents that share a relatively small number of topics. Dimensionality reduction tools such as Principal Component Analysis or Latent Semantic Indexing (LSI) have been widely adopted for document modeling, analysis, and retrieval. In this paper, we contend that a more pertinent model for a document corpus as the combination of an (approximately) low-dimensional topic model for the corpus and a sparse model for the keywords of individual documents. For such a joint topic-document model, LSI or PCA is no longer appropriate to analyze the corpus data. We hence introduce a powerful new tool called Principal Component Pursuit that can effectively decompose the low-dimensional and the sparse components of such corpus data. We give empirical results on data synthesized with a Latent Dirichlet Allocation (LDA) mode to validate the new model. We then show that for real document data analysis, the new tool significantly reduces the perplexity and improves retrieval performance compared to classical baselines. © 2010 ACM.
Persistent Identifierhttp://hdl.handle.net/10722/326849

 

DC FieldValueLanguage
dc.contributor.authorMin, Kerui-
dc.contributor.authorZhang, Zhengdong-
dc.contributor.authorWright, John-
dc.contributor.authorMa, Yi-
dc.date.accessioned2023-03-31T05:26:58Z-
dc.date.available2023-03-31T05:26:58Z-
dc.date.issued2010-
dc.identifier.citationInternational Conference on Information and Knowledge Management, Proceedings, 2010, p. 269-277-
dc.identifier.urihttp://hdl.handle.net/10722/326849-
dc.description.abstractLow-dimensional topic models have been proven very useful for modeling a large corpus of documents that share a relatively small number of topics. Dimensionality reduction tools such as Principal Component Analysis or Latent Semantic Indexing (LSI) have been widely adopted for document modeling, analysis, and retrieval. In this paper, we contend that a more pertinent model for a document corpus as the combination of an (approximately) low-dimensional topic model for the corpus and a sparse model for the keywords of individual documents. For such a joint topic-document model, LSI or PCA is no longer appropriate to analyze the corpus data. We hence introduce a powerful new tool called Principal Component Pursuit that can effectively decompose the low-dimensional and the sparse components of such corpus data. We give empirical results on data synthesized with a Latent Dirichlet Allocation (LDA) mode to validate the new model. We then show that for real document data analysis, the new tool significantly reduces the perplexity and improves retrieval performance compared to classical baselines. © 2010 ACM.-
dc.languageeng-
dc.relation.ispartofInternational Conference on Information and Knowledge Management, Proceedings-
dc.subjectLatent Dirichlet Allocation-
dc.subjectLatent Semantic Indexing-
dc.subjectPerplexity-
dc.subjectPrincipal Component Pursuit-
dc.subjectSparse keywords-
dc.titleDecomposing background topics from keywords by Principal Component Pursuit-
dc.typeConference_Paper-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1145/1871437.1871475-
dc.identifier.scopuseid_2-s2.0-78651312555-
dc.identifier.spage269-
dc.identifier.epage277-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats