Paper to PDF workflow with OCR on Linux

I had this dream for a long time to get rid of the mess of papers on my desk. My goal was to scan the documents to PDF, print a numbered label and save the OCR’ed document to some place in the cloud. It was important for me to produce fully searchable PDF files – not just images.
As it was not trivially easy to do, I would like to share my little solution for the profit of others who want to do the same.

Archival system

I thought quite some time about the optimal way of storing and retrieving the items. My old system with different folders for different document categories (like banking, personal, invoices, …) was not easy enough to use: I was just to lazy to archive my documents and the desk always became crowded with them.
My optimal technique is simple. I have two categories for documents: Very important and less important documents. I use different number groups for them. No other categories for the paper documents, everything else is categorized on the PC.
This is easy enough to quickly assign the category to a document, scan it, attach the label and put it away in the folder. The desk is clean again and the real categorization can be done later on the PC.
I hope that I can throw away the less important documents after a few years – this idea is the reason for the two categories. I could also call them “keep forever” and “throw away after some time”.

Hardware

HP ProLiant G7 MicroServer N54L
My cheap and small Linux server. Used as NAS and backup and also for my playtime projects.

The server with the label printer (and backup drive):

20150728_144332502_iOS

Brother P-touch QL570
The label printer used to print the numbered labels I stick on the processed documents.

Brother ADS-1600W
A document scanner that was capable of scanning to PDF and saving the file to a server on its own (most devices in the consumer price range need some software installed on the computer and the computer must be running for them to scan to PDF). Has duplex and working Wi-Fi. The price was around 300 Euros.

20150728_144243148_iOS

The scanner feed can be folded away and the scanner is quite small on the desk then:

20150728_144254713_iOS

Software

Ubuntu 14.04.2 LTS
I always liked Debian Linux, but I was missing a stable release schedule. For this reason, I switched to Ubuntu some years ago and still don’t regret the step.

incron
I did not know this software before the project: A kind of cron for file system events. Executes scripts when a given trigger occurs (create, move, …). Very handy.

Tesseract
The OCR component of the solution. A little complicated to use, but very powerful. I first thought about buying a commercial OCR software, but found the results with Tesseract to be sufficient for my purposes.

pypdfocr
Tesseract is picky about the input images and the handling is not too easy too. pyPDFtoOCR handles the low level stuff: Extract images from the PDF, increase contrast, call Tesseract and combine everything into a searchable PDF.

ql570
I had no success installing the Linux drivers for the label printer in CUPS. This project talks directly to the printers USB interface with no other dependencies. Takes an image, prints it and cuts the label.

ImageMagick 6
Used to create the label image file.

Putting it all together

When I had all the components working on their own, all I needed to do was to combine everything into a working system. Took me a few hours until every detail was working fine, but the system is rock solid for a few months now.

Set up the file server
The idea was simple: As I wanted different categories for my documents (each having their own counter), I created as many directories on the Samba file server.

For instance (in /etc/samba/smb.conf)

[longterm]

 browseable = yes

 read only = no

 path = /data/longterm

 create mask = 0777

 directory mask = 0777

Install the software
Most of the software I used is available in Ubuntu’s repository. But not everything. I hope I recall this correctly – the installation was some months ago. I installed german language packages for tesseract – you need to replace this with your own language of course. The poppler-utils contain pdfinfo which I use to extract the total number of pages from the PDF document.

apt-get install incron imagemagick tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu poppler-utils

Install PyPDFPCR according to the instructions on the website: http://virantha.com/2013/07/22/pyocr-a-python-script-for-running-free-ocr-on-your-pdfs/

Install ql570 following their instructions: https://github.com/sudomesh/ql570 – I was lazy and build it in /root/ql570.

Set up the label printer
Not much to do here. Connect the device to your Wi-Fi network and browse to the embeded web server. Add the shares (or different directories within one share).
You need to take note of the USB device that Linux uses for the printer.

tail -f /var/log/syslog

And turn on the device – it should be clear from the messages which device is used by the label printer (/dev/usb/lp0 in my case).

An example file is provided with the ql570 files. You can print this with the following command:

./ql570 /dev/usb/lp0 w example.png

Add the shell script
I wrote small script that handles everything from PDF to label. Save this somewhere (I always use /root/bin for the purpose). This was a write-only action, do not expect nice code.

Create a work-in-progress directory:

mkdir /data/processing

And the output directory:

mkdir -p /data/ready/Longterm

Save this as /root/bin/scanfile.sh and chmod +x /root/bin/scanfile.sh afterwards:

#!/bin/bash 

scanpath=/data

file="$1/$2"

user=$3

filename=${file##*/}

filename=${filename%.*}

procfile=$scanpath/processing/$filename.pdf

mv $file $procfile

/usr/bin/python /usr/local/bin/pypdfocr -l deu $procfile -c /root/bin/pypdf.yaml

procfilename=${procfile##*/}

procfilename=${procfilename%.*}

counterfile="/root/bin/scancount-$user.txt"

if [ -e $counterfile ]; then

        count=$(cat $counterfile)

else

        count=0

fi

((count++))

echo $count > $counterfile

rm $procfile

formattedcount=`/usr/bin/printf %05d $count`

ocrfile=$scanpath/processing/${procfilename}_ocr.pdf

outfile=$scanpath/ready/$user/$formattedcount-scan.pdf

mv $ocrfile $outfile

pagecount=`/usr/bin/pdfinfo $outfile | /bin/grep Pages | /usr/bin/awk '{print $2}'`

imgfile="/tmp/countimage_$user_$count.png"

date=`/bin/date +"%F %R"`

/usr/bin/convert -size 700x100 xc:white\

        -gravity NorthWest -pointsize 20 -font Arial-Regular -annotate 0 "$date"\

        -gravity North -pointsize 20 -font Arial-Regular -annotate 0 "$user"\

        -gravity NorthEast -pointsize 20 -font Arial-Regular -annotate 0 "$pagecount pages"\

        -gravity South -pointsize 40 -font Arial-Bold -annotate 0 "$formattedcount"\

        -rotate 90 -monochrome $imgfile

/root/ql570/ql570 /dev/usb/lp0 w $imgfile

Configure incron

Become root (I known this is bad practice, you are free to proceed differently):

sudo su -
incrontab -e

Add the following line:

/data/longterm IN_CREATE /root/bin/scanfile.sh $@ $# Longterm

Be happy
I pick up the documents from time to time from the ready directory and categorize them to folders in the cloud. I mostly use OneDrive’s full text search to find the documents again, but I also have a folder structure that has categories for long term and years for short term documents. Additionally, I usually rename the documents to something like this:

142-Hotel-Someplace-March-2015.pdf

My folders for long and short term documents:

20150728_144624360_iOS

A printed label:

20150728_144654913_iOS 

That – combined with the full text search – is good enough for me at the moment.

I wish you good success trying this out and I welcome comments or questions.

Leave a Reply

Your email address will not be published. Required fields are marked *