How to operate OCR engines

Optical character recognition or OCR is not a new topic in the field of document understanding, OCR is a technique (both electronic and mechanical) to transfer image un-editable text data to machine-encoded editable text (i.e., a "string" data type). We usually associate OCR with software. in other words, these are methods that:

Accept input in the form of images, scanned documents, PDF photographs, or computer-generated files.
The machine will detect the text present in pixel format, automatically, and “read” and “edit” it as a human would.
Convert the text to machine-readable format in such a manner that, we can search, edit, index, and further detail our understanding of unstructured data.

Skilled practitioner flow of OCR recognization

‍

The task is to convert image text data to machine-readable text using OCR engines. However, since the 1960s, when image interpretation and computer vision were first developed. researchers have struggled to develop generalized OCR systems that work in cases of broad and vague use.

‍

For example, if I had to show the following image to my OCR engine, I would expect it to sense the text, recognize the text, and then encrypt the text as editable string data.

Input

Output => CODITATION

Why OCR is challenging

However, despite its simplicity, OCR is exceptionally hard. Although the discipline of computer vision has been around for more than 50 years (with mechanical OCR machines dating back over 100 years), we have yet to "solve" OCR and create an off-the-shelf OCR system that works in almost any situation.
There are too many factors to think about., such as noise, writing style, image quality, etc. We're still a long way from resolving OCR. There are so many complexities in how humans share information through writing. As a result, we assert that systems for computer vision will never be able to read image text with 100% reliability
This blog would not exist if OCR had already been rectified. Your 1st Google search would have directed you to the program code you needed to apply OCR convincingly and correctly to your tasks. However, that is not the world we reside in. While we're getting better at tackling OCR challenges, knowing how to apply the present OCR engine, nevertheless requires a skilled practitioner.

Open-source OCR tools and Libraries

Tesseract

Tesseract, which was created by Hewlett Packard in the 1980s, was made open-source in 2005. Google eventually endorse the endeavor in 2006 and has served as a supporter since at. Tesseract software supports a wide range of natural languages, from English (at first) to Punjabi to Yiddish. Since the updates in 2015, it now supports over 100 written languages and has code in place so that it can easily be trained in other languages as well. Originally a C program, it was ported to C++ in 1998. The software is headless and can only be run from the command line. It does not include a graphical user interface (GUI), but various other software packages wrap Tesseract to offer one.
Tesseract is particularly fit for document processing piping systems in which images are scanned & pre-processed, and afterward, Optical Character Recognition is used.

EasyOCR

EasyOCR, as the name implies, is a Python package that enables computer vision programmers to accomplish Optical Character Recognition with ease.
The EasyOCR package is created and maintained by Jaided AI, a company that specializes in Optical Character Recognition services. Python and the PyTorch library are used to implement EasyOCR. When you have a CUDA-capable GPU, the inherent PyTorch deep learning library can drastically improve text detection and OCR speed. EasyOCR can currently OCR text in 58 languages, including English, German, Hindi, Russian, and others. The EasyOCR developers intend to add more languages in the coming years.EasyOCR currently only supports OCRing typed text. They also intend to release a handwriting identification system later in 2020!

Hands-on OCR

Tesseract

Install Tesseract on the system.
We must first configure the Tesseract library on the system before we can use it.
Tesseract will be installed on macOS using Homebrew.
$ brew install tesseract
If you're running Ubuntu, simply type apt-get to install Tesseract OCR.
$ sudo apt-get install tesseract-ocr

For Windows

Check that Tesseract is installed.
To make sure Tesseract was properly installed on your system, run the following command:
$ tesseract -v
Tesseract's version should be displayed on your screen, as well as a list of image file format libraries with which Tesseract is compatible.

Test out Tesseract OCR

How to Improve OCR Results

You can improve OCR accuracy by preprocessing your images with computer vision and image processing libraries like OpenCV and scikit-image. however, the question is what algorithms and techniques do you employ? Deep learning is willing to take responsibility for near-perfect accuracy in almost every field of computer science. For OCR, which deep learning models, layer types, and loss functions do you use?
Utilizing Tesseract options and configurations to improve OCR accuracy We are using machine learning to denoise our images to improve OCR accuracy. Tesseract performs different image processing operations internally (via the Leptonica library) before performing OCR. It usually does a fine job of this, but there will undoubtedly be cases where it falls short, resulting in a significant decrease in accuracy. However, image pre-processing techniques such as Rescaling, Binarisation, Noise Removal, Dilation or Erosion, Rotation or Deskewing, Borders, and Transparency or Alpha channel enhance OCR final inferences. In the case of complex images yielding no results, Tries to OCR the text but fails miserably, returning illogical results. I was annoyed when I couldn't get the correct OCR result. I had no thought about when and how to utilize various options. I had no idea how half of the options were managed because the documentation was so thin and lacked actual examples!

The lesson I learned, and perhaps one of the most common issues I see new OCR solving problems and making now, is failing to understand fully how Tesseract's page segmentation modes can strongly impact the correctness of your OCR output.

When operating with the Tesseract OCR engine, you must become acquainted with Tesseract's PSMs; without them, you will easily become upset and will be unable to achieve high OCR accuracy.

Simply supply the —help-psm argument to tesseract to get a list of the 14 PSMs. Moreover, skilled practitioners can play with the option of Tesseract Page Segmentation options as per input data. To see the detail of the tesseract PSM option - $ tesseract –help-psm

Figure 1: PSM option detail descriptions

Let's play with the input type and the PSM options.

‍CASE 1: we just want to verify the direction of text present in the input image for the below image

Figure 2: Just need orientation of text

It is pretty simple using tesseract PSM option 0, and the execution command is $ tesseract <image path> stdout --psm 0

Figure 3 Output of PSM 0 option

You can see the orientation of input is 0 degrees [maybe the degree 90, 180, 270 based on input], and also returns the script's confidence. (i.e., graphics signs/writing system), such as Latin, Han, Cyrillic, etc.

Figure 4: Just need the orientation of the text

$ tesseract aboutcoditation_rotated.png stdout –psm 0

Figure 5: Just need the orientation of the text

You can see the orientation for Figure 4 is in the output window of Figure 5 is 270 degrees and if you want to correct the visibility just rotate by 90 degrees in the reverse direction which also given in the output as a rotate option. However you may be confused about where is OCR text, --psm 0 mode does not perform OCR, just gives Orientation and script detection (OSD). In short, If you only need the info on the text, —psm 0 is the mode to use. Let's move toward the title of the blog.
CASE 2: Desire is text in the image of Figure 2, and it's not possible with PSM 0 then is any choice, yes there is the next number is 1 - $ tesseract aboutcoditation_rotated stdout --psm 1

Figure 6: OCR text of Figure 2

Awesome, you have taken baby steps toward the OCR engine, however, if you see the output there is no OSD information. Now let's take another new step.

‍
‍CASE 3: OCR default PSM is 3, so if I use that one for Figure 2 will it give me some improvement? The answer is yes. So skilled practitioners suppose to start with option PSM 3. Now take the simplest one.

‍
CASE 4: Single digit number depicted in Figure 6. As we said, start with the default option –psm 3 whereas the result is unfortunately Empty! So need to experiment with other options and if you test with PSM 6, 7, 8, 9, 10, and 13 gives the expected text. However, you better go with option PSM 10 only as per the remark of PSM 10: Image as a Single Character

Figure 7: One-digit number

$ tesseract 4.png stdout --psm 6

$ tesseract 4.png stdout --psm 7

$ tesseract 4.png stdout --psm 8

$ tesseract 4.png stdout --psm 9

$ tesseract 4.png stdout --psm 10

$ tesseract 4.png stdout --psm 13

$ tesseract 4.png stdout --psm 3

Figure 8: Result for Figure 7

These use cases will be discussed in more detail in my next blog. Stay Tuned!

‍

Hi, my name is Kiran Kamble. When I am done analyzing data, I play badminton and cricket and weekends are meant for hiking.

Want to receive update about our upcoming podcast?

Latest Articles

View All Articles

Implementing custom windowing and triggering mechanisms in Apache Flink for advanced event aggregation

Dive into advanced Apache Flink stream processing with this comprehensive guide to custom windowing and triggering mechanisms. Learn how to implement volume-based windows, pattern-based triggers, and dynamic session windows that adapt to user behavior. The article provides practical Java code examples, performance optimization tips, and real-world implementation strategies for complex event processing scenarios beyond Flink's built-in capabilities.

15

min read

Implementing feature flags for controlled rollouts and experimentation in production

Discover how feature flags can revolutionize your software deployment strategy in this comprehensive guide. Learn to implement everything from basic toggles to sophisticated experimentation platforms with practical code examples in Java, JavaScript, and Node.js. The post covers essential implementation patterns, best practices for flag management, and real-world architectures that have helped companies like Spotify reduce deployment risks by 80%. Whether you're looking to enable controlled rollouts, A/B testing, or zero-downtime migrations, this guide provides the technical foundation you need to build robust feature flagging systems.

12

min read

Implementing incremental data processing using Databricks Delta Lake's change data feed

Discover how to implement efficient incremental data processing with Databricks Delta Lake's Change Data Feed. This comprehensive guide walks through enabling CDF, reading change data, and building robust processing pipelines that only handle modified data. Learn advanced patterns for schema evolution, large data volumes, and exactly-once processing, plus real-world applications including real-time analytics dashboards and data quality monitoring. Perfect for data engineers looking to optimize resource usage and processing time.

12

min read