A little Alfresco / Tesseract OCR integration

I attended Alfresco DevCon in Berlin this year (a fantastic event) and two of the sessions that really caught my eye were given by Neil McErlean, Senior Software Developer, and Andy Hunt, Principal Support Engineer, about content transformations. I’d been playing around with an OCR tool called Tesseract for a bit on another project so thought I’d throw together a simple transformer. I was quite pleased with the results, so I thought I’d share them here in the hope that either someone finds the transforms themselves useful, or can use the code as a basis for writing their own transformer.

So what is Tesseract, then?

Tesseract in an open-source OCR tool. The project’s been around for a while, but has gained a bit of a boost recently thanks to Google, who have been making quite a few sizeable commits. It’s available as a package on Ubuntu, and is pretty accurate. It’s also free, which is nice.

What does the integration do?

The integration makes Tesseract available for use as a transformer in Alfresco. This means that you can upload an image of some text (such as a scan of a form or letter) to Share, call a REST service and Alfresco will return a plain text version of that image. What’s _really_ neat, though, is that Alfresco is wired such that if it can turn a document (such as our image) into plain text (using our transformer) then it will automatically index the document using the result of the transformation. This means that, with the integration enabled, you can search for images in a Share document library based on text located in those images. I think this is quite cute, and a nice quality of life feature to be able to put in front of users.

Show me!

Here’s a couple of screen-grabs:

First, you start off with an image to upload.  Something like this:

This is quite a difficult scan. Tesseract won’t be able to do a brilliant job of this, but we can still get a few keywords out

Pick a couple of words out of the image.  Let’s go with “Association”, or maybe “furnishes”.

Next, upload it to the document library

Lastly, search for a word in the document.  In this case, “Association”:

Hooray!  The document we just loaded has been returned from our search.

How does it work?

Alfresco makes it really simple to set one of these things up. You can get something primitive working without writing any code at all, but I chose to implement an instance of ContentTransformerHelper (an Alfresco provided common superclass for this sort of thing) as this gives you better control over what happens during transformation and better handling of exceptional conditions. Once you’ve written a single simple java class, all you need is a tiny bit of Spring to register the transformer with Alfresco and you’re done!

Rather than go into any further detail, I’ve uploaded the code to Google Code so you can have a look yourself. If there’s any mistakes you’d like to fix, or you’d like to make any changes to anything, just let me know!

How can I get this working myself?

It’s easy.

Firstly, get yourself a server. I’ve been testing this on an Ubuntu EC2 instance, since c1.medium.

Secondly, get yourself Alfresco. Simply download the community installer, run it and answer “Yes” to everything. After Alfresco has started up, test it out for a bit and shut it down again.

Thirdly, get yourself Tesseract. On Ubuntu, a simple apt-get install tesseract-ocr will do the trick. Other distributions and OSes wil vary, see the Tesseract homepage for further details.

Thirdly, install the integration. All you need to do is to download the binary amp file, place it in the [alfresco_root]/amps directory, run the [alfresco_root]/bin/apply_amps.sh script, start alfresco and you’re done.

Are there any drawbacks?

It’s OCR, so “Yes”. While Tesseract is great, it’s still far away from having perfect accuracy. In addition, OCR on a long document can be a time consuming and expensive process. The Alfresco transformation engine is quite good at queuing and prioritising transformations against other work in a way that won’t harm the user experience too much, and very good at allowing transformations to fail gracefully, but at the end of the day if you’re uploading a lot of images on comparatively modest hardware, you will notice a substantial drop in performance.

Additionally, the code I’m linking too is something I wrote for my own personal satisfaction. I’m not using it in production, and neither should you without taking the time to experiment and understand what trade-offs you’re making.

TL;DR

The Alfresco transformation engine is really cool. It’s really simple to create a new transformer, and if you write a transformer that converts a given file type to plain text, you get search indexing for free. There’s a little OCR integration I knocked together with Tesseract over here so you can see what I mean.

Thanks for reading.

Written by: Simon White