MinerU has a function to identify headers and footers, which it uses to analyze what to extract - it's pretty neat as it saves a pdf with markings what is what. It would still not do what you want without additional code, but perhaps this feature can be used to detect format changes. Like make a decision based on the pattern of layout structure and text patterns?

Interesting.  I actually did try passing it through tesseract OCR first.  Tesseract did a fine job, but the chatbot still couldn't do a good job.  One issue was that tesseract pulls out all text from the PDF, including page headers and footers that should not be treated as actual content.


SimpleStacker

It might be worth looking at something like 

. We built a proof of concept AI chatbot a while back that used this OCR tool to pull the text out of the PDFs we have. Converting to plain text first is going to give much better results in the chatbot.

Unsexy AI Failures: The PDF That Broke ChatGPT

Scoresby

It might be worth looking at something like [Docling](https://github.com/docling-project/docling). We built a proof of concept AI chatbot a while back that used this OCR tool to pull the text out of the PDFs we have. Converting to plain text first is going to give much better results in the chatbot.