This blog explores advanced Optical Character Recognition (OCR) applications using the Tesseract engine & reviews Tesseract's Page Segmentation Modes (PSMs) and provides guidance for their usage.
In our previous blog, we covered the basics of OCR, popular open-source tools; Tesseract & EasyOCR, with hands-on tutorials on how to use the tools effectively. In this blog, we all talk about some advanced use cases that we may encounter with OCR.
Take different italic style input, now this is a little more challenging and the output for this is unfortunately not recognized with any of the tesseract options, however for a few PSM options 80-90% result is accurate and depicted in Figure 9. For “i” it is t and “t” is ‘k’. Hence for now in tesseract their no option to recognize this scenario, so you can either try to re-train the tesseract model for this kind of input or you can use a commercial OCR engine.
Figure 1: Italic style
$ tesseract italic.png stdout --psm 8
Figure 2: Figure 1 output
CASE 2: Consider Figure 3, which is a receipt from the grocery store. Let’s try to OCR this image using the default (--psm 3) mode:
Figure 3: Whole Foods Market receipt we will OCR.
$ tesseract receipt.png stdout --psm 3
Figure 4: On Figure 3 with PSM 3
$ tesseract receipt.png stdout --psm 4
Figure 5: On Figure 3 with PSM 4
That did not go so well. Tesseract cannot imply that we are going to look at column data and that text within the same row must be associated together when we use the default —psm 3 mode.
To address this issue, we can use the —psm 4 mode. As you can see, the results are far superior. Tesseract understands that text should be clustered row-by-row, enabling us to OCR the receipt's items.
As you'll see, the outcomes are much better here. Tesseract acknowledges that text should be grouped row-by-row, enabling us to OCR the receipt's items.
[Figures 9 and 10]. PSM 12 mode is essentially identical to PSM 11, but it now includes OSD.
CASE 3: Now we will try for interesting and challenging input “automatic license/number plate recognition (ANPR) system”
Figure 6. Unfortunately, PSM 3 doesn't work for this input, whereas if we provide PSM 7 which handles the Image as a Single Text Line, gives the correct result, and even if tested with PSM 8 that also gives the same. However the difference between PSM 7 and 8 is a single line or a single word, so based on your input type you can select either of them.
Figure 6: A license plate we will OCR
$ tesseract numberplate1.png stdout --psm 3
$ tesseract numberplate1.png stdout --psm 7
Figure 7: Result on Figure 6.
CASE 4: Text presented in the form of rows and columns i.e sparse text, depicted in Figure 15 for this kind of input again we can go with the first PSM 3 default option whereas PSM 11 is best suited for this as it is specially designed for sparse text recognition.[Exprimention you can refer to Figures
Figure 8: Sparse text
$ tesseract sadhgurubook_chapter.png stdout --psm 3
Figure 9: Figure 8 OCR using PSM 3
$ tesseract sadhgurubook_chapter.png stdout --psm 11
Figure 10: Figure 8 OCR using PSM 11
Now let's try some big hurdles
“CASE 5: Handwritten text” and “CASE 10: Image in table form”. Figures 11 and 12 respectively. For case 5, our experimentation shows tesseract has the option PSM 9, which works well, however, a little harder handwriting does not work even with PSM 9. That's why full handwritten OCR is still a research topic.
Figure 11: Handwritten text image
$ tesseract handwriten.png stdout --psm 3
$ tesseract handwriten.png stdout --psm 9
Figure 12: Result of Figure 11
Moving towards the table image, of Figure 13: Top 10 cricket highest score teams in ODI presented table image format. If the table is present we expect the output is also in table format only but unfortunately with option PSM 3 and even with 11 we are not getting the same output result, output is depicted in Figure 14. In order to handle inputs of CASE 9 and 10, some image pre-processing will be necessary. To address this, I will be writing an additional blog post in the near future.
Figure 13: Top 10 cricket highest score teams in ODI in table image format
$ tesseract tabel.png stdout --psm 11 or 3
Figure 14: Result of Figure 13
There are lots of option are available in the tesseract PSM option. Each one of Tesseract's fourteen PSMs assumes certain information regarding your source images, such as a block of content for eg, a scanned book, a single sentence of text for eg, a single statement from an article, or perhaps a single word for eg, a driving license plate. Our skill is to select the correct option for desired output. Here I have presented various cases for the right choice of PSM. most of the time OCR is used in traffic monitoring video surveillance applications for number plate recognition and we want to go for an Open-source engine such as tesseract or Easyocr, currently, the tesseract is the best preference with PSM 7 or 8. In the billing receipt digitization process, if we need an invoice in excel for word format for further accounting, we can go tesseract PSM option 4, however, a few Non-ASCII characters present in an invoice are missing, you can ignore them by applying a filter in your script. Likewise before applying any PSM option just refer –psm help and start with the default preference PSM 3 and then rest as per PSM descriptions. The more experience you gain with PSMs, the easier it will be to apply OCR to your own tasks.
Discover how Polars, a powerful Rust-based DataFrame library for Python, revolutionizes high-performance data analysis and manipulation. Explore its key features, from speed and efficiency to data manipulation capabilities and lazy evaluation.
In this blog, we cover a wide range of topics, including monitoring, optimization, design patterns, error handling, security measures, scalability, and cost optimization, providing valuable insights and guidance for data engineers and practitioners working with big data processing on cloud platforms like Amazon EMR.