информационные технологии для архивов и библиотек
18 / 45


2015-August. Possibilities in OCR and server automation:

    1. Output: one raw text OCRed file (including from images)

    2. Zoning: when an original document is intelligently broken into sections (articles/columns in newspapers; images/tables/text)

    3. Structuring: automatic separation of fields: author, title, main text, etc.

    4. Indexing: automated selection of index terms which comes as an addition to the main OCRed text: people/company names, product names, geographic locations, etc.

    5. Data mining: more advanced analysis and processing of OCRed text to establish dependencies, patterns, etc.

    + API

additional idea:

   - provide storage and keyword searching service over OCRed text. On-the-fly database of semi-structured data.. + image of original documents. Tobaccodocuments can serve as an example..

   - provide stable URL to docs;

    + possible back office for manual data structuring and metadata assignment

2015-Июль: Goolge Docs free OCR test

Cравнение результатов OCR в баллах*

Вариант OCR Постановление Приложение Таблица
FineReader 1. ЧБ&Машинка.docx 3.25 2.6 3.7
FineReader 2. ЧБ&авто.docx 3.25 2 3.4
FineReader 3. Цвет&Машинка.docx 4 2 3.1
FineReader 4. Цвет&авто.docx 3.75 2 3.3
Tesseract 2.75 2 2
FineReader Online service 3.75 2 3.6







* отлично - 5; хорошо - 4; удовлетворительно - 3; неудовлетворительно - 2.


Результаты OCR тестовых документов
