We use cookies, just to track visits to our website, we store no personal details. Find out more...

Technical Pages

This article was written in 2007 and covers TOCR Versions 1.4 and 2.0. These versions have now been legacied and updated by Version 3.3 Pro. Please visit our version comparison page where you will find information on the differences between the versions. For information on the how the speed switches work in TOCR 3.3 and TOCR 4.0 please visit our speed versus accuracy page.

Our mission is to produce the most reliable and accurate OCR Engine on the market and TOCR Version 4.0 is our most accurate engine to date. We are currently working on Version 5.0. The technical pages will be updated following the release of TOCR Version 5.0.

Testing for Greater Accuracy in OCR

By Steve Anthony, Transym Computer Services

We thought some of our more technically minded readers might be interested to know more about how we train TOCR and to see some accuracy figures for the different versions of TOCR with the Lex option on and off, and on a variety of data groups.

Testing

OCR is a classic machine learning problem.

Within Transym, we have developed software which allows TOCR to "learn" how to be more accurate from the data it is provided with.

From this larger version we have created our production releases of TOCR, which as a subset, have been through exactly the same lengthy and rigorous training. The results are products which are not only robust but in which we have great confidence.

In order to make TOCR as accurate as possible, we have created or sourced a large set of images of different resolutions, sizes, sources and fonts, each with a text file containing what the image actually reads to a human.  We call this verification data, it is also sometimes called ground truth data. Because of human error there is no certainty the verification data is 100% accurate, and sometimes it is a matter of opinion as to what poor quality text reads.

It is not possible for us to release a version of TOCR that "learns" in a working environment. Training can only take place over very large datasets (which takes considerable time), otherwise TOCR would adapt itself to "local" conditions but lose accuracy in other areas.

We do however welcome images from users to add to our database, especially if TOCR does not perform as well as hoped for.

Some of the images and verification data we have sourced ourselves, and some are from the ISRI database.

This data was sufficient for the development and production of TOCR Version 1.4 (the data is predominantly English).

Possibly uniquely, TOCR was and is trained by presenting it with real pages and then testing the accuracy against the known verification data and feeding back this information to the learning program.

For Version 2.0 to properly cover the European language characters and the other characters from the full windows character set we needed additional data. We did two things to achieve this:

  • We took non-English text from a variety of sources and manufactured images of that text in a variety of fonts.

  • Since some characters occur very infrequently, we manufactured verification data by randomising equal numbers of each character, and then manufactured images of the random charters in a variety of fonts and italic/bold combinations. Each page is of the same font, point size and style, but a wide variety of pages were created in different fonts and styles.

Our total image set now consists of over 100,000 images. (*This is correct as of 25/06/2012 - the number of images is always growing. We will update this page occasionally to reflect this.)

The data is classified into training classes:

  • English real data: scanned images of magazines, scientific reports, letters etc. We have quite a lot of this data so we split into two parts:

  • The really difficult images to be used for training.

  • The less difficult which we put to one side as a check that we were not overtraining.

  • European language data: real text from websites and other sources but in the main the images are manufactured. This was used in training TOCR.

  • English manufactured data: randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic. We did not need to use this in training TOCR.

  • TOCR Version 2.0 recognisable character set (European for short) manufactured data: randomised characters on a page in the same font, different pages having a wide variety of fonts both regular and italic. This was used in training TOCR.


These classes have different characteristics and present different problems for TOCR.

The real data presents TOCR with all the problems associated with scanning, skew, merged characters, broken characters, noise and sometimes impossible to read images.

However, for English and European data we can use language lexical knowledge to help identify it, except where randomised.

The manufactured data is in effect a " perfect scan" but still presents TOCR with difficulties to be solved when characters are randomised:

  • No use can be made of any lexical knowledge.

  • Some glyphs within the same font are identical. For example in Arial font, the I (character code 73) and l (character code 108) appear identical.

  • Only the shape and size and line position of the character can really help identify it.


We now have 5 groups of data, to be tested under 2 conditions: Lex On and Lex Off, with Version 1.4 and Version 2.0 of TOCR. Lex On or Off is a TOCR processing option available to the end user and programmer.

As you might expect, using TOCR with Lex On for random data produces worse results than with Lex Off, so these accuracy figures have been omitted.

Additionally it's not useful information to test Version 1.4 on characters it has not been trained to recognise, so these accuracy figures have also been omitted.

It is simpler to think in terms of percentage of character errors rather than percentage of characters correct.

Data

Using TOCR with Lex On (the default)

 

Max Possible

V1.4 Errors

V2.0 Errors

V1.4 err %

V2.0 err %

English real data, Trained

13,145,982

205,426

199,070

1.56

1.51

English real data, Untrained

13,811,869

21,091

17,909

0.70

0.13

European manufactured data, Trained

9,773,310

15,271

-

0.16

-



Using TOCR with Lex Off

 

Max Possible

V1.4 Errors

V2.0 Errors

V1.4 err %

V2.0 err %

English real data, Trained

13,145,610

502,413

357,467

3.82

2.72

English real data, Untrained

13,811,869

244,533

90,035

1.77

0.65

English randomised data, Untrained

2,309,032

73,553

42,005

3.19

1.82

European manufactured data, Trained

9,773,310

-

43,331

-

0.44

European randomised data, Trained

10,265,900

-

252,775

-

2.46


The high error % for the line " English real data, Trained" is unsurprising as it contains the very hardest to read images, including some we think are impossible.  It is not representative of TOCR real world accuracy.

To get that, we would have to combine it with the "English real data, Untrained" line.   Even then, we think it overestimates the errors.   We have always preferred difficult data for training, thus biasing the selection.

In summary:

Version 2.0 is an accuracy improvement on Version 1.4 even though it has more scope to get it wrong, i.e. it recognises a wider range of characters.

Lex should be On where appropriate as this improves accuracy considerably.

We have tested some of the more popular competitor products on our data and have yet to find a more robust or accurate OCR engine under $1000.00. If there any examples to the contrary that can be provided by our user and partner communities, we would welcome the chance to explore them as part of our continued commitment to improving the accuracy and reliability of our solutions.