Muliltipage documents
Our biggest issue with all OCR models so far is handling multip-page documents correctly. These are mutu-page invoice documents which have tables spanning multiple pages (some even have hundreds of pages). So far we have been getting around this with post-processing, but it's flaky at best. The issue is, when running inference one page at a time, there is no context from previous pages (a continued table from a previous page could then get an appropriate header row for example).
Some ideas that would be really nice in a future version to help solve this type of issue:
- multipage support (send in multiple page images per request)
- the ability to supply the previous page image along with the current page to OCR. The previous page would not be extracted and would only serve as additional context
- Similar to the previous page idea above, perhaps instead of supplying the previous page image, instead have the ability to provide the current extraction (all markdown from all pages extracted so far, concatenated together for example) along with an image of the current page to extract. This would greatly help where tables and such span multiple pages.
Ivae tried a few of these during inference, but prompting seems to make LightOn OCR2 pretty fragile
Thanks for taking the time to consider these types of scenarios!
Some tests on v1 revealed that multi-page was working even without explicit multi-page training. But we didn't dig this deeper for lack of benchmarks for this kind of task!
One important thing, LightOnOCR is NOT meant to be prompted; adding any input text will just degrade performance.
It would be interesting to test this out by sending multiple images with no prompt.
I agree multipage ocr's are the biggest problem these days for the reasons already explained by jondecker.
He's described problem "The issue is, when running inference one page at a time, there is no context from previous pages (a continued table from a previous page could then get an appropriate header row for example)",
could be solved with an extra attention head, that keeps track of the last information between the previous page and the new page.
That way those tables can be rendered normally.
This problem describes a problem where pages are under eachother (vertical page layouts).
But the same problem exists for horizontal page layouts (book style), where context is seperated over a left and right page (children books, comic books, language learning books, often have a book-style layout where information is seperated over a left and right page).
This described problem 1 (context seperation over horizontal pages or vertical pages)
And then we also have problem 2 (garbage data - headers and footers) :
The output looks strange in those multipage cases, especially when the model does not recognise the difference between the content of the page versus it's headers(logos, document title,etc) and footers (disclaimers,page numbers,etc)
When you ocr such documents you get weird stuff like the header and footer in between page 1 and 2, and in between page 2 and 3.
A good model should be able to ignore that noise (meta data).
If prompting has a negative impact on the ocr process, then maybe a separate model should be created that gets rid of headers and footers, while still having a model that extracts everything RAW.
Problem 2 could be the reason that Problem 1 gets worse. Imagine that you have a model capable of detecting the table split, then this functionality could be broken if between the table, you have headers and footers... this confuses every model .. and would block an ai's ability to reconcile the table (or other splitted data).
These are the struggles of OCR software and OCR models.