pull down to refresh

It might be worth looking at something like Docling. We built a proof of concept AI chatbot a while back that used this OCR tool to pull the text out of the PDFs we have. Converting to plain text first is going to give much better results in the chatbot.
Interesting. I actually did try passing it through tesseract OCR first. Tesseract did a fine job, but the chatbot still couldn't do a good job. One issue was that tesseract pulls out all text from the PDF, including page headers and footers that should not be treated as actual content.
reply
MinerU has a function to identify headers and footers, which it uses to analyze what to extract - it's pretty neat as it saves a pdf with markings what is what. It would still not do what you want without additional code, but perhaps this feature can be used to detect format changes. Like make a decision based on the pattern of layout structure and text patterns?
reply