File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

TitleOCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Authors
Issue Date19-Oct-2025
Abstract

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG. To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG performance. Our OHRBench, including PDF documents, Q&As, and the ground truth structured data will be released to foster the development of OCR tailored to RAG and RAG systems that are resilient to OCR noise.


Persistent Identifierhttp://hdl.handle.net/10722/359007

 

DC FieldValueLanguage
dc.contributor.authorZhang, Junyuan-
dc.contributor.authorZhang, Qintong-
dc.contributor.authorWang, Bin-
dc.contributor.authorOuyang, Linke-
dc.contributor.authorWen, Zichen-
dc.contributor.authorLi, Ying-
dc.contributor.authorChow, Ka-Ho-
dc.contributor.authorHe, Conghui-
dc.contributor.authorZhang, Wentao-
dc.date.accessioned2025-08-19T00:32:03Z-
dc.date.available2025-08-19T00:32:03Z-
dc.date.issued2025-10-19-
dc.identifier.urihttp://hdl.handle.net/10722/359007-
dc.description.abstract<p>Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG. To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG performance. Our OHRBench, including PDF documents, Q&As, and the ground truth structured data will be released to foster the development of OCR tailored to RAG and RAG systems that are resilient to OCR noise.</p>-
dc.languageeng-
dc.relation.ispartofInternational Conference on Computer Vision (ICCV) (19/10/2025-23/10/2025, Honolulu, Hawai'i)-
dc.titleOCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation-
dc.typeConference_Paper-
dc.identifier.doi10.48550/arXiv.2412.02592-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats