Transym logo Transym Computer Services
Home
Products
Sales
Company
Download
Support
Technical

Technical
Testing for Greater Accuracy in OCR

By Steve Anthony, Transym Computer Services

We thought some of our more technically minded readers might be interested to know more about how we train TOCR and present some accuracy figures for the different versions of TOCR with the LEX option on & off and on a variety of data groups.

Testing & Data

OCR is a classic machine learning problem.

Within Transym, we have developed software which allows TOCR to "learn" how to be more accurate from the data it is provided with.

From this larger version we have created our production releases of TOCR, which as a subset, have been through exactly the same lengthy and rigorous training.  The results are products which are not only robust but in which we have great confidence.

In order to make TOCR as accurate as possible, we have created or sourced a large set of images of different resolutions, sizes, sources & fonts, each with a text file containing what the image actually reads to a human.  We call this verification data, it also sometimes called ground truth data.  Because of human error there is no certainty the verification data is 100% accurate, sometimes it is a matter of opinion, as to what poor quality text reads.

It is not possible for us to release a version of TOCR that "learns" in a working environment.  Training can only take place over very large datasets (which takes considerable time), otherwise TOCR would adapt itself to "local" conditions but lose accuracy in other areas.

We do however welcome images from users to add to our database, especially if TOCR does not perform as well as hoped for.

Some of the images & verification data we have sourced ourselves, some from the ISRI database available here : www.isri.unlv.edu/ISRI/OCRtk

This data was sufficient for the development and production of TOCR V1.4 (the data is predominantly English).

Possibly uniquely, TOCR was and is trained by presenting it with real pages and then testing the accuracy against the known verification data and feeding back this information to the learning program.

For V2.0 to properly cover the European language characters and the other characters from the full windows character set we needed additional data.  We did two things to achieve this :

  • From a variety of sources we took foreign text and manufactured images of that text in a variety of fonts.
  • Since some characters occur very infrequently we manufactured verification data by randomising equal numbers of each character and then manufactured images in a variety of fonts and italic/bold combinations.
Each page is of the same font, point size and style, but a wide variety of pages were created in different fonts and styles.

Our total image set now consists of over 53000 images totalling 1290 megabytes.

The data is classified into training classes:

  1. English real data, scanned images of magazines, scientific reports, letters etc
    We have quite a lot of this data so we split into two parts:
    1. The really difficult images to be used for training
    2. The less difficult which we put to one side as a check that we were not overtraining.
  2. European language data, real text from websites and other sources but in the main the images are manufactured.  This was used in training TOCR.
  3. English manufactured data, randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic.  We did not need to use this in training TOCR.
  4. TOCR v2.0 recognisable character set (European for short) manufactured data, randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic.  This was used in training TOCR.

These classes have different characteristics and present different problems for TOCR.

The real data presents TOCR with all the problems associated with scanning, skew, merged characters, broken characters, noise & sometimes impossible to read images.

For English and European data we can however use language lexical knowledge to help identify it, except where randomised.

The manufactured data is in effect a "perfect scan" but still presents TOCR with difficulties to be solved when characters are randomised :
  • No use can be made of any lexical knowledge.
  • Some glyphs within the same font are identical for example in Arial font, the I (character code 73) and l (character code 108) appear identical.
  • Only the shape & size & line position of the character can really help identify it.

We now have 5 groups of data, to be tested under 2 conditions Lexon and Lexoff, with version 1.4 and version 2.0 of TOCR. Lex on or off is a TOCR processing option available to the end user & programmer.

As you might expect using TOCR with Lex on, on random data produces worse results than with Lex off so these accuracy figures have been omitted.
Additionally it's not useful information to test v1.4 on characters it has not been trained to recognise, so these accuracy figures have also been omitted.

It is simpler to think in terms of percentage of character errors rather than percentage of characters correct.

Using TOCR with Lex on (the default)
Max PossibleV1.4 ErrorsV2.0 ErrorsV1.4 err %V2.0 err %
English real data, Trained131459822054261990701.561.51
English real data, Untrained1381186921091179090.700.13
European manufactured data, Trained977331015271-0.16-


Using TOCR with Lex off
Max PossibleV1.4 ErrorsV2.0 ErrorsV1.4 err %V2.0 err %
English real data, Trained131456105024133574673.822.72
English real data, Untrained13811869244533900351.770.65
English randomised data, Untrained230903273553420053.191.82
European manufactured data, Trained9773310-43331-0.44
European randomised data, Trained10265900-252775-2.46

The high error % for the line "English real data, Trained" is unsurprising as it contains the very hardest to read images, including some we think are impossible.  It is not representative of TOCR real world accuracy.
To get that, we would have to combine it with the "English real data, Untrained" line.  Even then, we think it overestimates the errors.  We have always preferred difficult data for training, thus biasing the selection.


In summary:

  • Version 2.0 is an accuracy improvement on version 1.4 even though it has more scope to get it wrong i.e. it recognises a wider range of characters.
  • Lex should be On where appropriate as this improves accuracy considerably.

We have tested some of the more popular competitor products on our data and have yet to find a more robust or accurate OCR engine under $1000.00.  If there any examples to the contrary that can be provided by our user and partner communities, we would welcome the chance to explore them as part of our continued commitment to improving the accuracy and reliability of our solutions.

Technical
Home  Products  Sales  Company  Download  Support  Technical  Copyright  Privacy Statement
Copyright © Transym, 2007. All Rights Reserved