How to operate OCR engines - II

This blog explores advanced Optical Character Recognition (OCR) applications using the Tesseract engine & reviews Tesseract's Page Segmentation Modes (PSMs) and provides guidance for their usage.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

How to operate OCR engines - II

In our previous blog, we covered the basics of OCR, popular open-source tools; Tesseract & EasyOCR, with hands-on tutorials on how to use the tools effectively. In this blog, we all talk about some advanced use cases that we may encounter with OCR.

CASE 1:

Take different italic style input, now this is a little more challenging and the output for this is unfortunately not recognized with any of the tesseract options, however for a few PSM options 80-90% result is accurate and depicted in Figure 9. For “i” it is t and “t” is ‘k’.  Hence for now in tesseract their no option to recognize this scenario, so you can either try to re-train the tesseract model for this kind of input or you can use a commercial OCR engine.

Figure 1: Italic style

$ tesseract italic.png stdout --psm 8

Figure 2: Figure 1 output

Figure 2: Figure 1 output

CASE 2: Consider Figure 3, which is a receipt from the grocery store. Let’s try to OCR this image using the default (--psm 3) mode:

Figure 3: Whole Foods Market receipt we will OCR.

$ tesseract receipt.png stdout --psm 3

Figure 4: On Figure 3 with PSM 3

$ tesseract receipt.png stdout --psm 4

Figure 5: On Figure 3 with PSM 4

That did not go so well. Tesseract cannot imply that we are going to look at column data and that text within the same row must be associated together when we use the default —psm 3 mode.
To address this issue, we can use the —psm 4 mode. As you can see, the results are far superior. Tesseract understands that text should be clustered row-by-row, enabling us to OCR the receipt's items.
As you'll see, the outcomes are much better here. Tesseract acknowledges that text should be grouped row-by-row, enabling us to OCR the receipt's items.

[Figures 9 and 10]. PSM 12 mode is essentially identical to PSM 11, but it now includes OSD.

CASE 3: Now we will try for interesting and challenging input “automatic license/number plate recognition (ANPR) system” 

Figure 6. Unfortunately, PSM 3 doesn't work for this input, whereas if we provide PSM 7 which handles the Image as a Single Text Line, gives the correct result, and even if tested with PSM 8 that also gives the same. However the difference between PSM 7 and 8 is a single line or a single word, so based on your input type you can select either of them.

Figure 6: A license plate we will OCR

$ tesseract numberplate1.png stdout --psm 3

$ tesseract numberplate1.png stdout --psm 7

Figure 7: Result on Figure 6.

CASE 4: Text presented in the form of rows and columns i.e sparse text, depicted in Figure 15 for this kind of input again we can go with the first PSM 3 default option whereas PSM 11 is best suited for this as it is specially designed for sparse text recognition.[Exprimention you can refer to Figures 

Figure 8: Sparse text

$ tesseract sadhgurubook_chapter.png stdout --psm 3

Figure 9: Figure 8 OCR using PSM 3

$ tesseract sadhgurubook_chapter.png stdout --psm 11

Figure 10: Figure 8 OCR using PSM 11 

Now let's try some big hurdles 

“CASE 5: Handwritten text” and “CASE 10: Image in table form”. Figures 11 and 12 respectively. For case 5, our experimentation shows tesseract has the option PSM 9, which works well, however, a little harder handwriting does not work even with PSM 9. That's why full handwritten OCR is still a research topic. 

Figure 11: Handwritten text image

$ tesseract handwriten.png stdout --psm 3

$ tesseract handwriten.png stdout --psm 9

Figure 12: Result of Figure 11

Moving towards the table image, of Figure 13: Top 10 cricket highest score teams in ODI presented table image format. If the table is present we expect the output is also in table format only but unfortunately with option PSM 3 and even with 11 we are not getting the same output result, output is depicted in Figure 14. In order to handle inputs of CASE 9 and 10, some image pre-processing will be necessary. To address this, I will be writing an additional blog post in the near future.

Figure 13: Top 10 cricket highest score teams in ODI in table image format

$ tesseract tabel.png stdout --psm 11 or 3

Figure 14: Result of Figure 13

Summary 

There are lots of option are available in the tesseract PSM option. Each one of Tesseract's fourteen PSMs assumes certain information regarding your source images, such as a block of content for eg, a scanned book, a single sentence of text for eg, a single statement from an article, or perhaps a single word for eg, a driving license plate. Our skill is to select the correct option for desired output. Here I have presented various cases for the right choice of PSM. most of the time OCR is used in traffic monitoring video surveillance applications for number plate recognition and we want to go for an Open-source engine such as tesseract or Easyocr, currently, the tesseract is the best preference with PSM 7 or 8. In the billing receipt digitization process, if we need an invoice in excel for word format for further accounting, we can go tesseract PSM option 4, however, a few Non-ASCII characters present in an invoice are missing, you can ignore them by applying a filter in your script. Likewise before applying any PSM option just refer –psm help and start with the default preference PSM 3 and then rest as per PSM descriptions. The more experience you gain with PSMs, the easier it will be to apply OCR to your own tasks.

Hi, my name is Kiran Kamble. When I am done analyzing data, I play badminton and cricket and weekends are meant for hiking.

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

Implementing feature flags for controlled rollouts and experimentation in production

Discover how feature flags can revolutionize your software deployment strategy in this comprehensive guide. Learn to implement everything from basic toggles to sophisticated experimentation platforms with practical code examples in Java, JavaScript, and Node.js. The post covers essential implementation patterns, best practices for flag management, and real-world architectures that have helped companies like Spotify reduce deployment risks by 80%. Whether you're looking to enable controlled rollouts, A/B testing, or zero-downtime migrations, this guide provides the technical foundation you need to build robust feature flagging systems.

time
12
 min read

Implementing incremental data processing using Databricks Delta Lake's change data feed

Discover how to implement efficient incremental data processing with Databricks Delta Lake's Change Data Feed. This comprehensive guide walks through enabling CDF, reading change data, and building robust processing pipelines that only handle modified data. Learn advanced patterns for schema evolution, large data volumes, and exactly-once processing, plus real-world applications including real-time analytics dashboards and data quality monitoring. Perfect for data engineers looking to optimize resource usage and processing time.

time
12
 min read

Implementing custom embeddings in LlamaIndex for domain-specific information retrieval

Discover how to dramatically improve search relevance in specialized domains by implementing custom embeddings in LlamaIndex. This comprehensive guide walks through four practical approaches—from fine-tuning existing models to creating knowledge-enhanced embeddings—with real-world code examples. Learn how domain-specific embeddings can boost precision by 30-45% compared to general-purpose models, as demonstrated in a legal tech case study where search precision jumped from 67% to 89%.

time
15
 min read