How to operate OCR engines - II

This blog explores advanced Optical Character Recognition (OCR) applications using the Tesseract engine & reviews Tesseract's Page Segmentation Modes (PSMs) and provides guidance for their usage.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

How to operate OCR engines - II

In our previous blog, we covered the basics of OCR, popular open-source tools; Tesseract & EasyOCR, with hands-on tutorials on how to use the tools effectively. In this blog, we all talk about some advanced use cases that we may encounter with OCR.

CASE 1:

Take different italic style input, now this is a little more challenging and the output for this is unfortunately not recognized with any of the tesseract options, however for a few PSM options 80-90% result is accurate and depicted in Figure 9. For “i” it is t and “t” is ‘k’.  Hence for now in tesseract their no option to recognize this scenario, so you can either try to re-train the tesseract model for this kind of input or you can use a commercial OCR engine.

Figure 1: Italic style

$ tesseract italic.png stdout --psm 8

Figure 2: Figure 1 output

Figure 2: Figure 1 output

CASE 2: Consider Figure 3, which is a receipt from the grocery store. Let’s try to OCR this image using the default (--psm 3) mode:

Figure 3: Whole Foods Market receipt we will OCR.

$ tesseract receipt.png stdout --psm 3

Figure 4: On Figure 3 with PSM 3

$ tesseract receipt.png stdout --psm 4

Figure 5: On Figure 3 with PSM 4

That did not go so well. Tesseract cannot imply that we are going to look at column data and that text within the same row must be associated together when we use the default —psm 3 mode.
To address this issue, we can use the —psm 4 mode. As you can see, the results are far superior. Tesseract understands that text should be clustered row-by-row, enabling us to OCR the receipt's items.
As you'll see, the outcomes are much better here. Tesseract acknowledges that text should be grouped row-by-row, enabling us to OCR the receipt's items.

[Figures 9 and 10]. PSM 12 mode is essentially identical to PSM 11, but it now includes OSD.

CASE 3: Now we will try for interesting and challenging input “automatic license/number plate recognition (ANPR) system” 

Figure 6. Unfortunately, PSM 3 doesn't work for this input, whereas if we provide PSM 7 which handles the Image as a Single Text Line, gives the correct result, and even if tested with PSM 8 that also gives the same. However the difference between PSM 7 and 8 is a single line or a single word, so based on your input type you can select either of them.

Figure 6: A license plate we will OCR

$ tesseract numberplate1.png stdout --psm 3

$ tesseract numberplate1.png stdout --psm 7

Figure 7: Result on Figure 6.

CASE 4: Text presented in the form of rows and columns i.e sparse text, depicted in Figure 15 for this kind of input again we can go with the first PSM 3 default option whereas PSM 11 is best suited for this as it is specially designed for sparse text recognition.[Exprimention you can refer to Figures 

Figure 8: Sparse text

$ tesseract sadhgurubook_chapter.png stdout --psm 3

Figure 9: Figure 8 OCR using PSM 3

$ tesseract sadhgurubook_chapter.png stdout --psm 11

Figure 10: Figure 8 OCR using PSM 11 

Now let's try some big hurdles 

“CASE 5: Handwritten text” and “CASE 10: Image in table form”. Figures 11 and 12 respectively. For case 5, our experimentation shows tesseract has the option PSM 9, which works well, however, a little harder handwriting does not work even with PSM 9. That's why full handwritten OCR is still a research topic. 

Figure 11: Handwritten text image

$ tesseract handwriten.png stdout --psm 3

$ tesseract handwriten.png stdout --psm 9

Figure 12: Result of Figure 11

Moving towards the table image, of Figure 13: Top 10 cricket highest score teams in ODI presented table image format. If the table is present we expect the output is also in table format only but unfortunately with option PSM 3 and even with 11 we are not getting the same output result, output is depicted in Figure 14. In order to handle inputs of CASE 9 and 10, some image pre-processing will be necessary. To address this, I will be writing an additional blog post in the near future.

Figure 13: Top 10 cricket highest score teams in ODI in table image format

$ tesseract tabel.png stdout --psm 11 or 3

Figure 14: Result of Figure 13

Summary 

There are lots of option are available in the tesseract PSM option. Each one of Tesseract's fourteen PSMs assumes certain information regarding your source images, such as a block of content for eg, a scanned book, a single sentence of text for eg, a single statement from an article, or perhaps a single word for eg, a driving license plate. Our skill is to select the correct option for desired output. Here I have presented various cases for the right choice of PSM. most of the time OCR is used in traffic monitoring video surveillance applications for number plate recognition and we want to go for an Open-source engine such as tesseract or Easyocr, currently, the tesseract is the best preference with PSM 7 or 8. In the billing receipt digitization process, if we need an invoice in excel for word format for further accounting, we can go tesseract PSM option 4, however, a few Non-ASCII characters present in an invoice are missing, you can ignore them by applying a filter in your script. Likewise before applying any PSM option just refer –psm help and start with the default preference PSM 3 and then rest as per PSM descriptions. The more experience you gain with PSMs, the easier it will be to apply OCR to your own tasks.

Hi, my name is Kiran Kamble. When I am done analyzing data, I play badminton and cricket and weekends are meant for hiking.

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

Optimizing Databricks Spark jobs using dynamic partition pruning and AQE

Learn how to supercharge your Databricks Spark jobs using Dynamic Partition Pruning (DPP) and Adaptive Query Execution (AQE). This comprehensive guide walks through practical implementations, real-world scenarios, and best practices for optimizing large-scale data processing. Discover how to significantly reduce query execution time and resource usage through intelligent partition handling and runtime optimizations. Perfect for data engineers and architects looking to enhance their Spark job performance in Databricks environments.

time
8
 min read

Implementing custom serialization and deserialization in Apache Kafka for optimized event processing performance

Dive deep into implementing custom serialization and deserialization in Apache Kafka to optimize event processing performance. This comprehensive guide covers building efficient binary serializers, implementing buffer pooling for reduced garbage collection, managing schema versions, and integrating compression techniques. With practical code examples and performance metrics, learn how to achieve up to 65% higher producer throughput, 45% better consumer throughput, and 60% reduction in network bandwidth usage. Perfect for developers looking to enhance their Kafka implementations with advanced serialization strategies.

time
11
 min read

Designing multi-agent systems using LangGraph for collaborative problem-solving

Learn how to build sophisticated multi-agent systems using LangGraph for collaborative problem-solving. This comprehensive guide covers the implementation of a software development team of AI agents, including task breakdown, code implementation, and review processes. Discover practical patterns for state management, agent communication, error handling, and system monitoring. With real-world examples and code implementations, you'll understand how to orchestrate multiple AI agents to tackle complex problems effectively. Perfect for developers looking to create robust, production-grade multi-agent systems that can handle iterative development workflows and maintain reliable state management.

time
7
 min read