Automated Data Extraction System Using Optical Character Recognition and Natural Language Processing for Scalable Big Data Processing
DOI:
https://doi.org/10.47392/IRJAEM.2026.0205Keywords:
Integrating OCR, rule-based NLP, achieving scalability, accuracy, efficiency unattainableAbstract
Scanned documents, which are usually identity cards, application forms, certificates, and reports, are usually used to store large volumes of information used in real-world applications. These files are usually in image or PDF format, with an unsorted textual data that makes it difficult to process data with an automatic system. Manual extraction processes are both time-consuming and require errors, as well as cannot handle large datasets. The following paper describes an automated Data Extraction System converting the unstructured content of documents into structured useful information with the help of Optical Character Recognition (OCR), rule-based Natural Language Processing (NLP), and Hadoop to scale-out big data processing. The suggested system will use modular architecture with the following elements: web-based frontend (HTML, CSS, JavaScript), FastAPI back-end, Hadoop Distributed File System (HDFS) and MapReduce-based document processing modules, and a database in the form of a structured data (PL/SQL database). It has support of user authentication, bulk document upload, real time extracted data visualization and export in JSON/Excel format through an easy interface. Hadoop parallelism experimental analysis of large-scale data exhibits 85 percent reduction in human workload, 92 percent accuracy in extraction and 4 times faster processing speed. These findings confirm the applicability of the system to enterprise systems and indicate the directions of improvement in the future such as extraction with machine learning, multilingual learning, and cloud-Hadoop hybrids.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 International Research Journal on Advanced Engineering and Management (IRJAEM)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
.