TOCR was by FAR the most accurate of the tools I tried.”
TOCR's main strength is its unrivalled accuracy. Because you can rely on the quality of the output, you can process more jobs - with less time spent correcting mistakes. This accuracy wasn't the result of a few intensive research sessions – it's the product of a decade of comprehensive testing, analysis and improvement. TOCR 4.0 is now able to:
- Offer up to 99% accuracy across eleven European languages.
- Read broken, blurred and obscure characters.
- Draw on thousands of base images to interpret data.
- Suffer less downtime due to improved stability.
- Provide up to four suggested word alternatives for badly reproduced or damaged characters, making the batch checking and completion process as quick as possible.
We're confident that no other OCR engine has been put through the same rigorous and innovative development process.
How TOCR “learns”
In simple terms, TOCR's accuracy is the result of years of “training” the system to recognise a vast range of different images, characters and languages. Here's how it works:
OCR presents a classic machine learning problem - how to teach our engine to recognise difficult data.
Over a decade, we've amassed thousands of images – with different resolutions, fonts and sizes, each with a text file containing what the image actually reads to the human eye.
We call this verification data. Because of human error there is no certainty that the verification data is 100% accurate (sometimes it's a matter of opinion what poor quality text reads).
Possibly uniquely, TOCR is trained by presenting it with real pages and then testing the accuracy against the known verification data - then feeding this information back to the learning program.
Training takes place over very large datasets (which takes considerable time), otherwise TOCR would adapt itself to "local" conditions but lose accuracy in other areas.
From 2006 onwards, in order to properly cover other European languages, we repeated the process with a huge range of foreign characters.
Our total image set now consists of over 108,000 images totalling over 8 gigabytes of compressed data – and growing.
We categorise all our data into different areas – each presenting different problems for TOCR – including skewed, merged and broken characters, and sometimes impossible to read images.
TOCR now draws on so much real life and created data that it is able to offer unparalleled OCR accuracy and reliability.
If you're interested in a more in-depth explanation of how our testing and development process works, visit our technical pages.
Want to find out more? Click on any of the following: