Automated Data Extraction System Using Optical Character Recognition and Natural Language Processing for Scalable Big Data Processing

Authors

  • Dr Ningthoujam Chidananda Singh Department of Computer Science, Yenepoya (Deemed-to-be University, Bangalore, Karnataka, India. Author
  • P. Navya Sree M.Sc. Computer Science (Data Science with Minor in Big Data Analytics), Yenepoya (Deemed-to-be University), Bangalore, Karnataka, India. Author

DOI:

https://doi.org/10.47392/IRJAEM.2026.0205

Keywords:

Integrating OCR, rule-based NLP, achieving scalability, accuracy, efficiency unattainable

Abstract

Scanned documents, which are usually identity cards, application forms, certificates, and reports, are usually used to store large volumes of information used in real-world applications. These files are usually in image or PDF format, with an unsorted textual data that makes it difficult to process data with an automatic system. Manual extraction processes are both time-consuming and require errors, as well as cannot handle large datasets. The following paper describes an automated Data Extraction System converting the unstructured content of documents into structured useful information with the help of Optical Character Recognition (OCR), rule-based Natural Language Processing (NLP), and Hadoop to scale-out big data processing. The suggested system will use modular architecture with the following elements: web-based frontend (HTML, CSS, JavaScript), FastAPI back-end, Hadoop Distributed File System (HDFS) and MapReduce-based document processing modules, and a database in the form of a structured data (PL/SQL database). It has support of user authentication, bulk document upload, real time extracted data visualization and export in JSON/Excel format through an easy interface. Hadoop parallelism experimental analysis of large-scale data exhibits 85 percent reduction in human workload, 92 percent accuracy in extraction and 4 times faster processing speed. These findings confirm the applicability of the system to enterprise systems and indicate the directions of improvement in the future such as extraction with machine learning, multilingual learning, and cloud-Hadoop hybrids.

Downloads

Download data is not yet available.

Downloads

Published

2026-05-07