How to extract formatted text from images

Published Oct 04, 2019Last updated Oct 08, 2019

Summary

Google launched an API some time ago for vision related tasks, its text extraction API works really well but it has some limitations for extracting formatted text like tables, in this post I explain a little bit about why I developed a simple postprocessing program to fix this issues allowing to extract formatted text like the one you could find on images of spreadsheet tables and documents. In the last section I explain how to use it.

The code can be found here.

Motivation

Google OCR provides a text output which might not have the expected format, if that's the case it also provides a JSON output with information about the position of each recognized entity. The problem is that this data is not so well structured for some tasks, extracting tokens (Series of characters without spaces between each other) is not so easy with this JSON since it doesn't provide this information in a straightforward way. That's why I built a small program to provide a way to postprocess this data into something more manageable, so it's more appropriate for text processing tasks like formatting text, extracting full lines of text or filtering words.

In order to do this a postprocessing code is provided at src/image2tokens.py. This is applied in order to extract tokens and then even more abstract concepts like text lines or table columns.

Usage

Requirements

python 3
python libraries (Try something like: pip install google-cloud-vision)
- google.cloud.vision
- google.protobuf
- google.oauth2

How to run it

On the src folder there is an usage example at table_example.py, where the tokenization is used to parse the image of a table.

python src/table_example.py sample.png

Usage Sample

Input

Output

                          HR Information                                 Contact
                                Position                                  Salary                                  Office                                   Extn.
                              Accountant                                $162,700                                   Tokyo                                    5407
           Chief Executive Officer (CEO)                              $1,200,000                                  London                                    5797
                 Junior Technical Author                                 $86,000                           San Francisco                                    1562
                       Software Engineer                                $132,000                                  London                                    2558
                       Software Engineer                                $206,850                           San Francisco                                    1314
                  Integration Specialist                                $372,000                                New York                                    4804
                       Software Engineer                                $163,500                                  London                                    6222
                       Pre-Sales Support                                $106,450                                New York                                    8330
                         Sales Assistant                                $145,600                                New York                                    3990
             Senior Javascript Developer                                $433,060                               Edinburgh                                    6224

Ocr Machine learning Image processing NLP

Report

Enjoy this post? Give Mathias Gatti a like if it's helpful.

Mathias Gatti

Software Developer specialized in Data Science

I am a software developer specialized in data science. I have a computer science degree and several years of experience as a programmer and math teacher. In my spare time I contribute to open source projects.

Discover and read more posts from Mathias Gatti

get started

13Replies

prashant shitole

4 years ago

Hi, I get table data without formatting from an image using PYTHON, can you let me know, what should I do for proper formatted data any source of any way to do that? if anyone knows references please suggest

Mathias Gatti

4 years ago

Maybe this: https://aws.amazon.com/es/textract/

Purushotham Rockss

5 years ago

Hi, I would like to get table data from an image using JAVA, can you let me know, any source of any way to do that? if anyone knows references please suggest

Mathias Gatti

5 years ago

Maybe this: https://github.com/tabulapdf/tabula-java

Agro Vibe

6 years ago

Hi Mathis Gatti, you have done awesome work, I would like to know that how i can fetch data on the basis of header so i could avoid other extracted data as well because i am working bank statements and facing format variation.

Mathias Gatti

6 years ago

Hi! You could try identifying header tokens based on its size, header tokens have usually a higher height. You might use some specific words that are usually contained on those headers to identify them. After that you can try checking which tokens are aligned with the columns of the headers to extract all the table.

Show more replies