October 12, 2023

A couple days ago I decided to get my act together regarding incoming mail (physical ones) and other documents so I scanned everything before throwing to big box with “Archive” label and… started to think about enabling search of the scanned documents.

Searching through documents is a topic for another post, here we’ll focus on getting text from images by means of Optical Character Recognition service from AWS – AWS Textract.

Preparation

AWS Textract processes documents from S3 bucket, so we need one before we can start our text extraction process, so let’s create one:

aws s3api create-bucket --bucket YOUR_BUCKET_NAME --create-bucket-configuration LocationConstraint=eu-west-1
aws s3api put-bucket-lifecycle-configuration --bucket YOUR_BUCKET_NAME --lifecycle-configuration '{
  "Rules": [
    {
      "ID": "ExpireAfter1Day",
      "Status": "Enabled",
      "Prefix": "",
      "Expiration": {
        "Days": 1
      }
    }
  ]
}'

Here we create a bucket and then set lifecycle policy to automatically delete all objects after 1 day – this way we get rid of processed document in an automatic way.

Processing documents

Now it’s a time to process the document and get text. First we upload the file from local disk to S3, then we call AWS Textract detect-document-text action to get the results. As the result is quite long JSON, but we need just the text, we post-process the results using jq:

file=test.png
bucket=YOUR_BUCKET_NAME
aws s3 cp $file s3://$bucket/
aws textract detect-document-text --document "S3Object={Bucket=$bucket,Name=$file}" | jq -r '.Blocks[] | select(.BlockType == "LINE") | .Text'

The result will be the text extracted from the image file. You can save it to the file, DB or anywhere you find it useful for further processing 🙂

The nice thing about AWS Textract is that you don’t have to care about the format of the file, as AWS Textract can recognize text from both images as well as PDF files 🙂

Subscribe to my mailing list to get updated about new blog posts!

You may also find these posts interesting:

OCR from CLI with AWS Textract

A couple days ago I decided to get my act together regarding

About the author - Łukasz Tomaszkiewicz

My name is Łukasz Tomaszkiewicz and I help developers and devops engineers to accelerate their career in public cloud.

For many years I've supported developer teams in delivering cloud based solutions - from design phase, through automation of deployment to monitoring and optimization as well as day-to-day operations.

In my professional life I'm mainly focused on AWS infrastructure, however - after hours - I like doing my own side projects as it helps me understanding the issues developers may have while using the cloud.

  • That’s a solid point about game fairness – transparency is key! Seeing platforms like ph889 prioritize RNG & RTP is encouraging. Curious to try their quick registration – check out ph889 download for a streamlined experience! Seems geared towards Filipino players.

  • That’s a solid point about game fairness – transparency is key! Seeing platforms like ph889 prioritize RNG & RTP is encouraging. Curious to try their quick registration – check out ph889 download for a streamlined experience! Seems geared towards Filipino players.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
    >