I built a hybrid pipeline that auto-detects PDF type and selects the best path:
Digital PDFs: text is parsed directly with layout-aware extraction.
Scanned PDFs: converted to images and processed via Tesseract OCR (Poppler for rendering), then reassembled into searchable PDFs.