OCR
2015-August. Possibilities in OCR and server automation:
1. Output: one raw text OCRed file (including from images)
2. Zoning: when an original document is intelligently broken into sections (articles/columns in newspapers; images/tables/text)
3. Structuring: automatic separation of fields: author, title, main text, etc.
4. Indexing: automated selection of index terms which comes as an addition to the main OCRed text: people/company names, product names, geographic locations, etc.
5. Data mining: more advanced analysis and processing of OCRed text to establish dependencies, patterns, etc.
+ API
additional idea:
- provide storage and keyword searching service over OCRed text. On-the-fly database of semi-structured data.. + image of original documents. Tobaccodocuments can serve as an example..
- provide stable URL to docs;
+ possible back office for manual data structuring and metadata assignment
2015-Июль: Goolge Docs free OCR test
Cравнение результатов OCR в баллах*
Вариант OCR | Постановление | Приложение | Таблица |
FineReader 1. ЧБ&Машинка.docx | 3.25 | 2.6 | 3.7 |
FineReader 2. ЧБ&авто.docx | 3.25 | 2 | 3.4 |
FineReader 3. Цвет&Машинка.docx | 4 | 2 | 3.1 |
FineReader 4. Цвет&авто.docx | 3.75 | 2 | 3.3 |
Tesseract | 2.75 | 2 | 2 |
FineReader Online service | 3.75 | 2 | 3.6 |
* отлично - 5; хорошо - 4; удовлетворительно - 3; неудовлетворительно - 2.
____________________________________________________________________________________________________
Результаты OCR тестовых документов
# | Name |
---|---|
1 | Постановление |
2 | Приложение |
3 | Таблица |
4 | Рукописный текст |
Sources