
In the past, an Intelligent Document Processing (IDP) pipeline always involved long hours of trying to draw boundaries for data items, labelling them, and creating multiple templates to handle different formats for the input documents. The positions shifting by a few pixels meant that the pipeline was broken. Typically, these Optical Character Recognition (OCR) systems did not read the document content; they only scraped the data based on the spatial coordinates. Then, enter the Large Language Models era, where some of these models have built-in vision capabilities that accept images as input and are capable of extracting data from them. This meant that developers could now send documents without worrying about a rigid structure and get a structured JSON response from it. But does that really mean specialised OCR models are now dead? In the world of Enterprise Architecture, absolute statements are rarely true. In reality, it is slightly more nuanced. While the major LLMs that have vision capabilities have mostly eliminated the need for dedicated trained OCR models, they are still relevant in high-volume edge cases. Also, if the LLM used is an open-source model, it is highly likely that the results achieved are not nearly acceptable. Here is how the landscape has shifted, where the OCR models still survive, and some true facts about open-source vision LLMs General Shift from specialized OCR models to vision LLMs: To understand why the shift has started, we have to look at how the OCR models work in general. A specialised model for document processing requires building an extensive pipeline: you would have to collect samples all possible structures, label them, and/ or train the custom model. On the other side, Vision-capable LLMs process documents holistically - They don’t extract text based on the location; they understand the semantic relationships in the images. Most of the frontier LLMs have zero-shot capabilities - you can upload an entirely new document and ask the LLM to look for the data items you need and provide the output in a predefined structured format. All that it needs is good prompt engineering and structured response techniques. This means now there is no longer a need to know all document formats upfront and There is no need to retrain when known templates have changed. Most of these frontier models understand the linguistic context as well as the structure, making them far better than traditional OCR models: Tables : Reading data across complex multi-column layouts. Charts and Graphs : Visual Analytics are also understood along with the textual analysis. Agentic Capabilities: Instead of just doing data extraction, a VLM can do reasoning: “Extract the data in the document only if it is an order receipt. If not, return a NULL JSON object” These native capabilities could effectively reduce the size of the pipeline, which traditionally would have had OCR parsing, classification, and custom code to achieve this. This is where the “mostly” comes into the picture: If Vision LLMs could solve all problems entirely, then Azure Document Intelligence, Google Document AI and AWS Textract would no longer exist. It all comes down to the physics and finance: High Scale : Processing a huge scale of documents via the frontier LLMs would mean you are bankrupting your company’s finances. At massive scales, like for millions of pages, It still makes more logical sense to build and use specialised OCR models that only cost a fraction of the cost per image. Zero Hallucination Guarantee : Vision LLMs are probabilistic in nature - they are just guessing the next token. Although their accuracy is high, there is still a possibility that they could hallucinate ‘1’ as ‘7’ in a dense and low-resolution image. On the other hand, OCR models are deterministic and provide extraction results with confidence scores. The Other Problem: Open Source Vision LLMs Someone reading this would think that if processing huge volumes of images is costly with frontier LLM APIs, then the logical next option would be to use open-source LLMs. But here is the hard truth - the open source LLMs are mostly not good enough for production document processing. While these are making very good progress with the support of open-source communities, these models are still seeing issues in processing standard, dense, and complex documents. Here are some of the problems: High-Resolution Context problem: Proprietary models use dynamic cropping and advanced vision encoders to process images without losing any fine details in them. However, most of the open-source models compress images into a grid of visual tokens to save on the limited VRAM that is available to them. Spatial Blindness: Even though the open source models can identify, say, if there is a dog in the image, it is most likely difficult for them to locate and extract the price of an item from the fourth column on the third row. Compute Trade-Off: To get an open-source LLM to approach the accuracy of a proprietary model requires a cluster of high-end GPUs, which then completely negates the point of open-source LLMs to save costs. Way Forward - Hybrid Approach: The industry need not have to choose one option among all - It could mostly be a hybrid approach - Instead of relying on a monolithic structure to do it all, you use the right model for the right layer: | Tool Category | Best Use Case | The Trade-offs | |----|----|----| | Traditional/Open-Source OCR (e.g., Tesseract) | deterministic coordinate mapping, low-cost operations. | Fails on complex structures; zero / less reasoning ability. | | Open-Source OCR/VLM Hybrids (e.g., Docling, olmOCR) | Converting messy PDFs to clean Markdown on local GPUs for RAG pipelines | might requires significant infrastructure and engineering to deploy at scale | | Frontier VLMs (e.g., Claude 3.5 Sonnet, GPT-4o) | Unstructured extraction, reasoning-heavy agentic workflows, dynamic document types. | High API costs; occasional probabilistic hallucinations | Finally, here are the considerations for any developer working with IDP pipelines, Route based on Complexity: For highly complex documents, prefer using a Frontier VLM. For forms with good structure and predictable format, try using open source LLMs. If the open source LLMs are not doing good job and it will be costly to use the frontier VLMs - stick to OCR models. \
View original source — Hacker Noon ↗


