OCR For Java

As far as I know there is no pure 'Java' based OCR solutions available as open source or commercial at the moment. It just seems like a distant dream to expect Java to do complex image analysis and extract text out of it. There are some solutions which makes use of JNI and binaries to get this done, and hence it is not pure.

There is a solution quite popular from Asprise available on the internet. This one sucks. It really cant scan the text very well and most of the time all you get is junk characters. If anyone else had better luck, let me know.

I would recommend Tesseract OCR which open source and handled by people from Google. The code is also on the Google Code site. And this one scans pretty well and is being consistently improved.

Integration to Java would have to be using JNI or Runtime calls to the executables. It is worth it to use this if you want a free tool or else go or IRIS or Abby.

Some additional info on Tesseract

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

Latest Release (Dec 6th, 07) supports many languages and is quite powerful.

Links

Code, Download & Project Info: http://code.google.com/p/tesseract-ocr/

A Java GUI for tesseract:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/58fcfab8ef3ae7c1
http://sourceforge.net/project/showfiles.php?group_id=153105&package_id=253235&release_id=562073

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License