ИнфоРост
информационные технологии для архивов и библиотек
32 / 45

PDF

Fully automated PDF data extraction software
Automated PDF data extraction solutions come in different flavours, ranging from simple OCR tools to enterprise ready document processing and workflow automation platforms. Most systems share however a similar workflow:

1. Assemble batches of samples documents which acts as training data
2. Train the system for each type of document you want to process
3. Set up a process to automatically fetch documents, process them and dispatch the data

Most advanced solutions use a combination of different techniques to train the data extraction system. A simple method is for example Zonal OCR where the user simply defines specific locations inside the document with a point & click system. More advanced techniques are based on regular expressions and pattern recognition.

  • Layout-aware text extraction from full-text PDF of scientific articles
  • PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
  • Pdf2text.java