How to extract formatted text from images
Summary
Google launched an API some time ago for vision related tasks, its text extraction API works really well but it has some limitations for extracting formatted text like tables, in this post I explain a little bit about why I developed a simple postprocessing program to fix this issues allowing to extract formatted text like the one you could find on images of spreadsheet tables and documents. In the last section I explain how to use it.
The code can be found here.
Motivation
Google OCR provides a text output which might not have the expected format, if that's the case it also provides a JSON output with information about the position of each recognized entity. The problem is that this data is not so well structured for some tasks, extracting tokens (Series of characters without spaces between each other) is not so easy with this JSON since it doesn't provide this information in a straightforward way. That's why I built a small program to provide a way to postprocess this data into something more manageable, so it's more appropriate for text processing tasks like formatting text, extracting full lines of text or filtering words.
In order to do this a postprocessing code is provided at src/image2tokens.py. This is applied in order to extract tokens and then even more abstract concepts like text lines or table columns.
Usage
Requirements
- python 3
- python libraries (Try something like:
pip install google-cloud-vision
)- google.cloud.vision
- google.protobuf
- google.oauth2
How to run it
On the src folder there is an usage example at table_example.py
, where the tokenization is used to parse the image of a table.
python src/table_example.py sample.png
Usage Sample
Input
Output
HR Information Contact
Position Salary Office Extn.
Accountant $162,700 Tokyo 5407
Chief Executive Officer (CEO) $1,200,000 London 5797
Junior Technical Author $86,000 San Francisco 1562
Software Engineer $132,000 London 2558
Software Engineer $206,850 San Francisco 1314
Integration Specialist $372,000 New York 4804
Software Engineer $163,500 London 6222
Pre-Sales Support $106,450 New York 8330
Sales Assistant $145,600 New York 3990
Senior Javascript Developer $433,060 Edinburgh 6224
Hi, I get table data without formatting from an image using PYTHON, can you let me know, what should I do for proper formatted data any source of any way to do that? if anyone knows references please suggest
Maybe this: https://aws.amazon.com/es/textract/
Hi, I would like to get table data from an image using JAVA, can you let me know, any source of any way to do that? if anyone knows references please suggest
Maybe this: https://github.com/tabulapdf/tabula-java
Hi Mathis Gatti, you have done awesome work, I would like to know that how i can fetch data on the basis of header so i could avoid other extracted data as well because i am working bank statements and facing format variation.
Hi! You could try identifying header tokens based on its size, header tokens have usually a higher height. You might use some specific words that are usually contained on those headers to identify them. After that you can try checking which tokens are aligned with the columns of the headers to extract all the table.