PDF Extractors

PDFExtractor

Extract text, images and tables as strings, bytes and json respectively using this extractor.

OCRMyPdf

Extract text content from image based pdf files using this ocrmypdf based extractor.

UnstructuredIO

This extractor uses unstructured.io to extract pieces of pdf document into separate plain text content data.

LayoutLM Document QA

This is a fine-tuned version of the multi-modal LayoutLM model for the task of question answering on documents. It has been fine-tuned using both the SQuAD2.0 and DocVQA datasets.

Marker Extractor

Markdown extractor converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.