Optical Character Recognition (OCR): PyTesseract vs. EasyOCR

Dr. Marco Berta
2 min readFeb 13, 2021

--

receipt OCR

Text extraction from an image is becoming one of the most common application of artificial intelligence. It is something you approach since the beginning of your deep learning coding career when you confront the MNIST dataset and read about convolutional neural networks (CNNs). As you advance, you will need to apply these basic concepts to implement more sophisticated solutions, such as a script that automatically reads the plate number of a car [1]. But you don’t always need to build a CNN from scratch. These functions are offered as services by main clouds providers and you can use free open-source solutions to be included in a python script.

When I received the assignment of detecting text and digits to extract expenses from a bill, I tested two popular OCR libraries that run flawlessly on a Google-Colabs notebook. I could download on the Colabs virtual machine a sample of 200 receipts images using the command

!wget https://expressexpense.com/large-receipt-image-dataset-SRD.zip # download from source!unzip large-receipt-image-dataset-SRD.zip -d /content/invoice_data/ # extract to a custom subfolder "invoice_data"

Then I imported OpenCV for loading and preprocessing the image,

gray, img_bin = cv2.threshold(image,20,255,cv2.THRESH_BINARY | cv2.THRESH_OTSU)
thresh = cv2.bitwise_not(img_bin)

to be analyzed by Pytesseract. Results weren’t always accurate and I looked for a similar Python library to check if these could be improved, EasyOCR. EasyOCR is built with Pytorch library,and having a GPU speeds up the whole process of detection. This is not an issue as GPU runtime can be used for free in Google Colabs. As shown below in Figure 1, more characters were detected with a higher accuracy. In this case, image preprocessing was not necessary since it is done automatically, but a language has to be specified. You can select several among 58, and I just chose English as default [2].

Fig. 1 PyTesseract vs. EasyOCR

Bottom line, EasyOCR was the winner for today with the minor downside of requiring a GPU accelerated machine as default.

References

[1] https://jideilori.medium.com/ocr-with-machine-learning-55c7d082fe78

[2] https://www.pyimagesearch.com/2020/09/14/getting-started-with-easyocr-for-optical-character-recognition/

The full code can be found at https://github.com/opsabarsec/Receipts-OCR-on-colabs

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Dr. Marco Berta
Dr. Marco Berta

Written by Dr. Marco Berta

Senior Data Scientist @ ZF Wind Power, Ph.D. Materials Science in Manchester University

No responses yet

Write a response