Interesting.  I actually did try passing it through tesseract OCR first.  Tesseract did a fine job, but the chatbot still couldn't do a good job.  One issue was that tesseract pulls out all text from the PDF, including page headers and footers that should not be treated as actual content.


It might be worth looking at something like [Docling](https://github.com/docling-project/docling). We built a proof of concept AI chatbot a while back that used this OCR tool to pull the text out of the PDFs we have. Converting to plain text first is going to give much better results in the chatbot.

kepford

Unsexy AI Failures: The PDF That Broke ChatGPT

Scoresby

1. For my research, I recently had to take long PDFs that contained multiple documents smushed together into one PDF file (mostly letters and reports), and find the document boundaries.  All the AI tools I tried did a pretty bad job at that, but it's something a human could have done easily.

2. Check out AI's attempts to draw ascii art: https://stacker.news/items/1031420