| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Testing for Greater Accuracy in OCR
By Steve Anthony, Transym Computer Services We thought some of our more technically minded readers might be interested to know more about how we train TOCR and present some accuracy figures for the different versions of TOCR with the LEX option on & off and on a variety of data groups. Testing & Data OCR is a classic machine learning problem. Within Transym, we have developed software which allows TOCR to "learn" how to be more accurate from the data it is provided with. From this larger version we have created our production releases of TOCR, which as a subset, have been through exactly the same lengthy and rigorous training. The results are products which are not only robust but in which we have great confidence. In order to make TOCR as accurate as possible, we have created or sourced a large set of images of different resolutions, sizes, sources & fonts, each with a text file containing what the image actually reads to a human. We call this verification data, it also sometimes called ground truth data. Because of human error there is no certainty the verification data is 100% accurate, sometimes it is a matter of opinion, as to what poor quality text reads. It is not possible for us to release a version of TOCR that "learns" in a working environment. Training can only take place over very large datasets (which takes considerable time), otherwise TOCR would adapt itself to "local" conditions but lose accuracy in other areas. We do however welcome images from users to add to our database, especially if TOCR does not perform as well as hoped for. Some of the images & verification data we have sourced ourselves, some from the ISRI database available here : www.isri.unlv.edu/ISRI/OCRtk This data was sufficient for the development and production of TOCR V1.4 (the data is predominantly English). Possibly uniquely, TOCR was and is trained by presenting it with real pages and then testing the accuracy against the known verification data and feeding back this information to the learning program. For V2.0 to properly cover the European language characters and the other characters from the full windows character set we needed additional data. We did two things to achieve this :
Our total image set now consists of over 53000 images totalling 1290 megabytes. The data is classified into training classes:
These classes have different characteristics and present different problems for TOCR. The real data presents TOCR with all the problems associated with scanning, skew, merged characters, broken characters, noise & sometimes impossible to read images. For English and European data we can however use language lexical knowledge to help identify it, except where randomised. The manufactured data is in effect a "perfect scan" but still presents TOCR with difficulties to be solved when characters are randomised :
We now have 5 groups of data, to be tested under 2 conditions Lexon and Lexoff, with version 1.4 and version 2.0 of TOCR. Lex on or off is a TOCR processing option available to the end user & programmer. As you might expect using TOCR with Lex on, on random data produces worse results than with Lex off so these accuracy figures have been omitted. Additionally it's not useful information to test v1.4 on characters it has not been trained to recognise, so these accuracy figures have also been omitted. It is simpler to think in terms of percentage of character errors rather than percentage of characters correct. Using TOCR with Lex on (the default)
Using TOCR with Lex off
The high error % for the line "English real data, Trained" is unsurprising as it contains the very hardest to read images, including some we think are impossible. It is not representative of TOCR real world accuracy. To get that, we would have to combine it with the "English real data, Untrained" line. Even then, we think it overestimates the errors. We have always preferred difficult data for training, thus biasing the selection. In summary:
We have tested some of the more popular competitor products on our data and have yet to find a more robust or accurate OCR engine under $1000.00. If there any examples to the contrary that can be provided by our user and partner communities, we would welcome the chance to explore them as part of our continued commitment to improving the accuracy and reliability of our solutions. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||