How Machine Learning Optical Character Recognition Could Be Implemented
By
Aleksandr Solonskyi
Data Science Solution Architect at MobiDev
Have you ever faced challenges while creating user-oriented digital security algorithms? Designing a more efficient solution to replace the creation and maintenance of paperwork for numerous employees is certainly beneficial.
However, even it the era of Data Science and Machine Learning, reinventing security-related services is no easy task.
Let’s see the approach to develop software solutions with deep learning Optical Character Recognition (OCR) for processing US driver’s licenses and IDs for text recognition.
OCR Is Typically a Machine Learning and Computer Vision Task
This technology began with the scanning of books, text recognition and hand-written digits (NIST dataset). Detecting printed text is somewhat different, as identifying texts “in the wild”, such as road signs, license plates or outdoor advertising signs, is decidedly more difficult.
OCR is commonly used for optimization and automation. Some examples are checking test answers, real-time translations, recognizing street signs (Google Street View) and searching through photos (Dropbox). For each case a completely different OCR solution is used.
Choose OCR Technology After Clarifying Business Goals and Use Cases
Applying Machine Learning algorithms to biometric identification system is one of the emerging AI trends. In our case, the client has an automated identity verification system. It is running a comprehensive match, by comparing an official document such as a driver’s license with a photo (selfie).
Data Used For Optical Character Recognition:
- Front-side photo of driver’s license for face detection and OCR verification
- Rear-side of ID to get barcode data: name, date of birth, etc.
- Photo to compare with ID photo
During beta-testing, it was discovered that users could readily fool the system. Some of them were sending photos from 2 different documents: front side – an actual id; back side – a fake one with incorrect name and date-of-birth. Various fraud attempts – it’s something we also have to deal with. To solve this, a separate cross-checking module has been designed. The goal was to analyze and compare the information from both sides of an ID to see if it matches.
Another anti-fraud component contains a separate neural network to provide an anti-spoofing technique for face recognition.
Choosing the Optimal Machine Learning OCR Approach
As all use cases were defined, Data Science and Machine Learning engineers started to explore existing OCR APIs and SDKs. There are numerous open-source, as well as commercial, ready-to-use solutions.
At first glance it seemed to be an easy process – take the best commercial system, process photos, check information, and presto! – it’s done. However, that proved to be too simplistic a solution, and didn’t work.
Security Versus User-Friendliness Of The OCR Engine
Selfies and ID photos were often taken in different conditions by different cameras. As a result, image quality and positioning varied widely.
Setting a high accuracy barrier to gain security resulted in an extremely high incidence of photo rejection. Which proved to be inconvenient and annoying to the user. The task then became to minimize rejection of legitimate access attempts while maintaining a high level of security.
In a step toward obtaining maximum accuracy, the variety of permissible photos taken by users was restricted. As a tool to satisfy photo requirements, users were provided with a smoothless UI/UX experience and clear instructions on how the photo should be taken.
Since we now had more or less standardized photos, we were able to set an optimal “security / user-friendliness” balance for OCR. Many computer vision-based OCRs operate as “black boxes”, but we needed raw data as well.
The Driver’s License Is A Complicated Document
Every state in the U.S. has its own document format, and the design changes every few years. The option of programming preset templates for the documents, therefore, does not exist. Also, the quality of photos on state-issued documents is low. Both of these conditions are important factors to consider when researching OCR solutions.
Be Aware of OCR System Imperfections
Being based on machine learning and computer vision, OCRs are subject to errors. Our challenge was to implement an effective combination of an OCR system and a cross-checking mechanism for low-quality photos, and produce a premises entry system which was both reliable and secure. Thus, both Software Engineering and Data Science skills were required.
Defining R&D Flow
The typical questions asked before proceeding to AI solution development are, which data do I need and how can I use the data I have at hand?
Data Understanding & Data Mining For An OCR
This is where the process begins. At this stage, we collected a relatively small dataset of real ID photos. You may be wondering if we really needed data understanding and a mining step if using an existing OCR. The answer is a definite yes.
The initial dataset contained around 100 driver’s licenses and 150 IDs. We used both open datasets and those we collected ourselves.
Research and Evaluation of OCRs With Machine and Deep Learning
As a first step, we compared available commercial and open-source OCR solutions. After shortlisting, we used our custom ID dataset to evaluate performance on real data. The number of names / surnames / DOB direct matches was chosen as a metric.
As a result, we decided to utilize Google Vision that has the most accurate matching result (80% in both cases). Google Vision service is proved to be the most accurate and robust service. Other solutions have different disadvantages:
- Complicated eDNA
- Uncontrolled security
- Unknown measurement units
- Inflexible third-party service
After Google, Abbyy and Tesseract services are taken place in terms of performance. However, performance of Google service is almost 2 times better.
The final evaluation and decision making should be done on the UberTesters data.
Why Not Try to Evaluate Using A Huge Open Dataset with Book Scans or Other Pictures with Texts?
This is a question of data compliance with the task. Driver’s licenses are both complicated and unique in structure. What’s more, users will possibly provide pictures of those IDs which are of low quality. There is no doubt that many different OCRs would work perfectly with high quality photos, or if the text were perfectly aligned. But reality did not present such perfection.
The evaluation needed to demonstrate actual performance in compliance with the real task. Therefore, the most relevant dataset was that which was collected according to our solution scenarios, by our target audience.
OCR Engine Implementation
Designing a solution for data security with an OCR component, required a combination of Data Science, Machine Learning and Software Engineering. One challenge was to parse the raw data that Google Vision returned, and compare it with information from barcodes, with 100% accuracy. Even though Google has the best OCR system, it still makes mistakes. Those mistakes are acceptable in many cases, but not when considering security purposes. To solve this problem, we built an extra layer to prevent fraud attempts designed to exploit those OCR imperfections.
Extra Security Measures Using OCR Machine Learning Algorithms For MRZ
Machine Readable Zone (MRZ) is a part of a travel document with well defined (letter or number) data fields. It also contains checking digits – parameters to confirm that the data was read correctly.
But even with Google Vision the recognition accuracy level is far from 100%. It is for that reason we designed an additional component for cross-checking. We analyzed common errors that OCR tends to make when detecting data.
For example, the symbols “1 / I / i / l”, or “0 / O / Q / D”, look very similar and may be misinterpreted by the system. Gathering these error statistics allowed us to assist and correct the OCR if a mistake occurred.
Main Takeaways
To sum up our experience:
- Identify business goals and implementation of use cases you would like to achieve with an AI solution. This greatly influences the approaches, architecture and tools to be used
- Understand your data. Data should be appropriate for the tasks in your project and be as real as possible
- Data quantity and sources. After you understand the data, you can make an educated decision whether open or commercial datasets best fill your business needs
- Building your own Data Science model and training a neural network. In this case it’s better to use less data, but the most relevant. Using huge datasets which do not accurately represent your particular project’s real data will not yield successful results
Posted with permission from Mobidev