One of the issues that doing a large amount of scanning can create is getting the index information connected to the scanned document for storage. The challenge created is to decide whether to simply use manual index entry using human operators or to build a more automated workflow which can capture the images and extract data from the scanned files and use it to index them for storage and retrieval.
Luckily, users are not on their own in building this process. There are software tools which are designed to assist in this process and which can help automate the input. Knowing where to apply these tools and which ones will work best for any process is one of the areas where a good vendor or consultant adds real value.
While it takes more work to create the automated process the time savings, and potential accuracy gains can easily provide a return on the investment in the process.
What do these tools look like?
The answer is it varies. Here are some of the most common ways to extract the data.
Bar code data input. Either by reading bar codes already on the documents or by applying a bar code label, or using a bar coded separator sheet it is possible to input document index data using these tools. In my experience properly structured bar code data entry is the most reliable and often easiest process to deploy if you understand how they work.
Optical Character Recognition. Often called zonal OCR, this process involves identify a section or data field like an invoice number that occurs uniquely on each document in the same location. By indentifying this location to the software program the information is extracted from the zone, recognized and converted to text characters and then this information becomes an index field for storage with the image. The success of this process depends on many things including the quality of the image being processed, the font used to create the original data, the OCR software being used, the number of fields being processed and even the consistency of the scans for size, skew and backgrounds. When it works Zonal OCR can be almost magic. When it doesn't the clean up can be a challenge.
Data Base Lookup. With this process either manually or using another type of automated process (see bar code, zonal OCR) to identify one piece of identifying data, you then compare that data to a database of other information and extract from the data base the related fields you wish to have in your index. Using the invoice model above, think about pulling the vendor name, PO number and shippling information to match that invoice once it has been found.
Full Text Search. A popular process that many vendors promote is full text searching of the image file. In this case all of the text in the originating scanned document is converted into a text file using OCR software and stored with the image of the document. When you wish to find the document you search on a text string or word you expect to be in the document.
So which is best?
None of these is perfect. All of them have advantages and difficulties.
To use bar codes, you need to have them in the document or be able to associate them with the document somehow.
To use zonal OCR you have to be able to accurately separate the field data from the surrounding image.
To use data base lookup you have to be able to access the data base and find a field to capture for reference.
To use full text search you have to have highly variable documents with very specific identifying keywords otherwise your keyword search is going to yield thousands of potential solutions to your search terms.
The final piece of the puzzle is where do you find the software to process your images in this way?
Most image capture software will have some or all of these tools available. The more sophisticated your software the more refined these tools will be. There are also third party packages which can be integrated with scanning software for post scan processing and routing to your document management solution. Some of these packages can build very sophisticated workflows which can do much more than just index your images. They also often can handle electronically generated files as well as scanned files making it possible to merge your electronic output and your paper records into a similar process.
Have you been frustrated by manually indexing image files? Is the process you use for document management input automated?
Share your comments below.....
Photo credit: Wikimedia Commons Public Domain