Beyond OCR: Using Multimodal AI to Extract Clean Data from Messy Docs

A quick recap of pros, cons, and lessons learned

I've been working across a dozen different industries over the last 12 months, each with their own challenges. But in terms of AI there's one use case that pops up again and again: Value extraction from documents.

It's an almost eternal struggle. Every B2B business has to deal with incoming documents at some point. PDFs, Word files, or, as one exec recently told me on the side during a conference: "Excel spreadsheets printed out, crumpled up, and scanned in again a few times."

Traditional OCR technology has been around since the 1970s. And yes, most businesses are using some kind of OCR solution already. But many core challenges remain unresolved. New GenAI models promise to revolutionize this use case (again).

Today, let's unpack what's real, what's hype, and how you can actually start leveraging this technology effectively.

What is Multimodal Document Understanding?

Multimodal document understanding builds on AI models that can process text and images natively. Multimodality is a huge deal. As I wrote previously, image processing with LLMs went nuts (in a good way).

Why does that matter for document understanding?

Let’ break it down!

Traditional OCR treats documents as flat collections of characters. Aside from some basic layout heuristics, classical OCR doesn’t care what it’s reading. Its job is to extract characters, string them together, and hope they form something useful—leaving all the heavy lifting for downstream tools or manual review.

Multimodal models like GPT-4o, Gemini, and Claude, on the other hand, don’t just see characters—they interpret context. Instead of extracting text, they "look” at the documents, making it simultaneously a computer vision and natural language processing task. This allows them to simultaneously process semantics, layout, and visual cues. They acknowledge that a document is more than just text—it's an interaction of structure, content, and design working together to convey meaning.

Here’s a quick comparison to make that tangible:

Traditional OCR

Multimodal AI

Recognizes characters only

Understands text and visuals together

Ignores document structure

Interprets layout and relationships

Tables/forms need separate handling

Handles complex layouts natively

Relies on templates and zones

Template-fee across document types

Output plain text

Extracts data, answers questions

Why This Matters Now

The timing for multimodal document understanding couldn't be better. There are three streams that converge:

First, major AI labs have made significant breakthroughs in multimodal capabilities. OpenAI's GPT-4o, Google's Gemini, Anthropic's Claude, or xAI’s Grok all launched with impressive document processing abilities in the past year. Heck there are even open source LLMs with native image processing. These models represent a step-change in what's possible.

Second, businesses are drowning in documents more than ever. In many industries, digitalization hasn’t reduced the amount of documents, it has actually increased it (e.g. by switching from paper to PDFs).

Third, implementation barriers have fallen dramatically. You no longer need specialized ML expertise to deploy these solutions. Anyone with a ChatGPT account can get started.

The question isn't whether to replace your current OCR systems, but rather where these new capabilities can add more value. The shift only makes sense when it delivers better results or lower costs. Ideally, both.

Key Benefits of Multimodal Document Understanding

There are a few reasons that speak for Multimodal Document Understanding:

✅ Holistic Document Comprehension

Multimodal AI understands documents as a whole rather than as disconnected elements. For example, if a checkbox marked in one section changes how another section should be interpreted, a multimodal AI can catch that in one go—something that would’ve required extensive rule-building with traditional tools. These models also take into account where information appears on the page—whether it’s in the header, footer, sidebar, or main body—to interpret it correctly and extract the right data. The same goes for visual hierarchies, like font sizes, bold text, and other formatting cues that signal importance.

✅ Complex Structure Handling

Multimodal models excel at scenarios that typically trip up traditional OCR systems:

  • Tables: Preserve row and column structure, even with merged cells or complex headers

  • Forms: Understand label-field relationships without manual mapping

  • Multi-column layouts: Follow text flow accurately across columns

  • Mixed content: Handle combinations of text, images, and graphics

Creating a form logic alone could have taken weeks in a traditional OCR project – just for a single document type.

✅ Contextual Information Extraction

Beyond just recognizing text, multimodal models can pull specific data points based on how they’re described. For example, instead of relying on fixed coordinates to extract an invoice total, a multimodal model can understand phrases like “total amount due after tax” and locate the correct value – regardless of where it appears on the page or how it's labeled. This ability to leverage the model’s embedded knowledge turns document processing from a mechanical task into a semantic one.

The shift from “what text is here?” to “what does this document mean?” is where the true power of multimodal understanding comes into play.

Current Limitations and Challenges

As promising as this technology is, there are still a few caveats business leaders should keep in mind before ditching their old OCR system and jumping into the new Multimodal AI world:

❌ Lack of Bounding Boxes for Precise Element Location

This might sound silly but for many companies, missing bounding boxes are a serious dealbreaker. A large number of document workflows are still human-augmented—meaning a person reviews and validates the model’s suggestions. Now imagine reviewing a 30-page document and not being able to tell where exactly the model found a specific value. While it's relatively easy to identify the right page (since each can be processed individually), pinpointing the precise location within that page—especially for dense legal or financial text—isn’t currently possible out of the box.

Traditional OCR systems provide bounding boxes for every extracted element, which makes it easy to highlight and validate data in context. Multimodal models don’t natively support this yet. Google has introduced early approaches that return bounding boxes in Gemini but it's still work in progress.

Via Google

In practice, many teams fall back to searching for the extracted result in the raw document text and then highlighting it using search-based logic—not ideal, and adding a lot complexity to an otherwise straightforward solution.

❌ Inconsistency with Slight Input Changes

Traditional OCR systems were fairly deterministic—either they worked, or they didn’t. They didn’t "make things up." But with LLMs, even small changes in input—like scanning the same document at a slightly different angle or resolution—can lead to different outputs. Sometimes the model interprets the content slightly differently, sometimes it introduces hallucinations.

One practical workaround is to introduce a second layer of AI for validation. For example, have another model check the output against expected patterns or business rules, or run multiple passes and compare results for consistency. But again, this adds complexity and cost.

Speaking of which…

❌ Cost and Speed Considerations

These models are computationally heavy and may not be the best fit for low-latency or high-volume real-time tasks. $2.50 per 1M input token doesn’t sound much but cost add up quickly if you process many documents with high resolution at scale.

On the other hand, many legacy OCR providers that also charge on a per-document base are often much more expensive than calling modern GenAI models.

❌ Data Privacy, Security, and Regulation

Uploading sensitive documents to 3rd party AI services might raise compliance concerns, especially in industries like finance, law, and healthcare. In particular, AI regulations such as the EU AI Act may impose stricter requirements on you for using these solutions, or even prohibit their use altogether.

For data security and privacy, deploying models on cloud platforms that you use anyway or hosting open-source models on-prem can be a viable alternative.

Best Practices for Implementation

While it’s too early to provide a definitive guide, I can see some effective patterns emerging. Consider these suggestions as pointers, not rules.

1. Break complex documents into step-by-step workflows
Instead of trying to extract all information at once, do it step by step. For example, have one AI call extract the date, another fetch the right price, and perhaps yet another extract metadata. Build an AI workflow that uses context from one process step as input for another. This way, you can adjust prompts by document types or include metadata dynamically into the prompt.

2. Enhance accuracy with multiple runs
To reduce hallucinations and improve extraction accuracy, you can run the same prompt multiple times and compare the outputs. This approach helps surface inconsistencies that might not be obvious from a single run. Depending on your quality requirements, you can either:

  • Use majority voting to select the most consistent result

  • Or apply a strict validation rule, such as returning an error if even a single value deviates across, say, 10 attempts

This kind of redundancy may sound excessive, but in high-stakes workflows—like financial reporting or compliance audits—it will give you the confidence you need.

3. Structure your prompts thoughtfully
The principles for effective prompts of course do apply here. Make sure to give specific instructions (e.g. how to format the output) and include details (e.g. what parts of the document to ignore – can be depending on document metadata)

Conclusion

The line between unstructured and structured data is blurring, and organizations that embrace this shift early will be the ones who redefine what's possible in document processing and insight extraction.

Don’t throw aways your legacy OCR systems yet – try to enhance them using GenAI. In many workflows, there’s room for both. In any case, keep an eye on the cost! Especially if you’re working with legacy OCR solutions, shifting to a GenAI solution can be much cheaper at similar accuracy rate.

TL;DR: Don’t use multimodal AI because it’s cool — use it because it actually drives business value.

See you next Friday,
Tobias

Reply

or to participate.