Amazon Textract

Omkar Khapare

24 Aug 2020

Sarita Gawade, Python Developer at Webcubator Technologies, has shared some remarkable insight into one of the machine learning applications that helps you extract text data from plain images. Exciting, isn’t it? Keep reading...

What is Textract?

Textract is an Amazon machine learning service that extracts text and data from given documents. It is a more advanced technique to extract data than simple optical character recognition (OCR).

With Amazon Textract, you can detect and analyze the text in single or multi page input documents.

Input documents can be Image (JPEG, PNG) for single page and pdf for multi page documents.

A document is made up of the following types of Block objects.

1. Pages

2. Lines and words of text

3. Form data (Key-value pairs)

4. Tables and Cells

5. Selection elements

Why Textract?

There are tons of data present in this world. The data can be structured (tables,forms,database,JSON) or unstructured (Textual,Image,Video).

If the data you require for your business is already available in any form, and you want to use it, then you need to access that data and modify it according to your business requirement.

Textract provides you with the functionality to extract data from images and pdf. If you automate the data extraction process using Textract, it can save your manual data entry task.

How does it work?

Amazon Textract enables synchronous operations for processing small, single-page, documents and for getting near real-time responses.

You can enable asynchronous operations too, that you can use to process more significant, multipage documents. Asynchronous responses aren't in real-time.

When you provide any document as Textract input internally, it goes through 2 processes:

a. Detecting Text:

This process involves synchronous and asynchronous operations that return only the text identified in a document.

b. Analysing:

In this process, Textract analyzes the documents and forms relationships between detected text. After analyzing Amazon Textract returns the following in multiple Block objects:

1. Lines and words of recognised text

2. Content of detected items

3. Relationship between detected items

4. The item on the document page

After getting the extracted data, users can apply business logic to get data in the required format.

Example for extracting image from s3 bucket using Textract in Python:

import boto3

textract = boto3.client('textract')

response = textract.detect_document_text(

Document={

'S3Object': {

'Bucket': 'BucketName',

'Name': 'imagePath'

}

})

Example of Textract output:

{

'BlockType': 'LINE',

'Confidence': 99.46614074707031,

'Text': 'NOODLES',

'Geometry': {

'BoundingBox': {

'Width': 0.09946346282958984,

'Height': 0.02102060243487358,

'Left': 0.20258571207523346,

'Top': 0.07283180207014084

},

'Polygon': [

{

'X': 0.20306137204170227,

'Y': 0.07283180207014084

},

{

'X': 0.3020491600036621,

'Y': 0.07432965189218521

},

{

'X': 0.3015735149383545,

'Y': 0.09385240823030472

},

{

'X': 0.20258571207523346,

'Y': 0.09235455840826035

}

]

},

'Id': '94039146-eb32-4126-b47f-513e6128814a',

'Relationships': [

{

'Type': 'CHILD',

'Ids': [

'777aaa6c-fd1b-45b9-b7b6-28c6de804d42'

]

}

]

}

Features of Textract:

1. Textract uses Optical Character Recognition (OCR) technology to detect text

2. Extract structured and unstructured data

3. Can extract single and multipage documents

4. Security & Compliance

With so many machine learning applications in the world, textract seems to be the one of the most useful applications. If your business hasn’t leveraged the benefits of machine learning yet, you’re missing out on a lot of fun and revenue!