pull down to refresh
111 sats \ 1 reply \ @SimpleStacker 3 Oct \ parent \ on: Unsexy AI Failures: The PDF That Broke ChatGPT AI
Interesting. I actually did try passing it through tesseract OCR first. Tesseract did a fine job, but the chatbot still couldn't do a good job. One issue was that tesseract pulls out all text from the PDF, including page headers and footers that should not be treated as actual content.
MinerU has a function to identify headers and footers, which it uses to analyze what to extract - it's pretty neat as it saves a pdf with markings what is what. It would still not do what you want without additional code, but perhaps this feature can be used to detect format changes. Like make a decision based on the pattern of layout structure and text patterns?
reply