pull down to refresh

  1. For my research, I recently had to take long PDFs that contained multiple documents smushed together into one PDF file (mostly letters and reports), and find the document boundaries. All the AI tools I tried did a pretty bad job at that, but it's something a human could have done easily.
  2. Check out AI's attempts to draw ascii art: #1031420
111 sats \ 2 replies \ @kepford 3 Oct
It might be worth looking at something like Docling. We built a proof of concept AI chatbot a while back that used this OCR tool to pull the text out of the PDFs we have. Converting to plain text first is going to give much better results in the chatbot.
reply
Interesting. I actually did try passing it through tesseract OCR first. Tesseract did a fine job, but the chatbot still couldn't do a good job. One issue was that tesseract pulls out all text from the PDF, including page headers and footers that should not be treated as actual content.
reply
MinerU has a function to identify headers and footers, which it uses to analyze what to extract - it's pretty neat as it saves a pdf with markings what is what. It would still not do what you want without additional code, but perhaps this feature can be used to detect format changes. Like make a decision based on the pattern of layout structure and text patterns?
reply