Skip to content

Why extracting data from PDFS is still a nightmare for data experts

    Robot is on a set of books and read a book.

    Credit: Kirillm via Getty Images

    However, according to recent tests, these promotional claims do not always match the performance of the practice. “I am usually a pretty big fan of the Mistral models, but the new OCR-Specific that they released last week was really bad,” Willis noted.

    “A colleague sent this PDF and asked if I could help him park the table it contained,” says Willis. “It is an old document with a table with a number of complex layout elements. The new [Mistral] OCR-specific model performed really badly, repeating the names of cities and the failure of many of the figures. “

    AI App developer Alexander Doria also recently pointed to X a mistake with Mistral OCR's ability to understand handwriting and writes: “Unfortunately, Mistral-Coc still has the usual VLM-curse: with challenging manuscripts it completely.”

    According to Willis, Google is currently leading the field in AI models that can read documents: “At the moment the clear leader Google's Gemini 2.0 Flash Pro Experimental. It treated the PDF that Mistral did not do with a small number of mistakes, and I have successfully performed several messy PDFs, including people with handwrits.”

    The performance of Gemini is largely due to the ability to process extensive documents (in a kind of short -term memory called a “context window”), which will specifically notice Willis as an important advantage: “The size of the context window also helps, because I can upload large documents and work in parts.” This possibility, combined with a more robust processing of handwritten content, apparently gives the Google model a practical lead over competitors in real-World document processing tasks for now.

    The disadvantages of LLM-based OCR

    Despite their promise, LLMS introduces various new problems to document the processing of documentation. Among them they can introduce confabulations or hallucinations (plausible sounding but incorrect information), accidentally follow instructions in the text (thinking that they are part of a user prompt), or simply interpret the data in general.