I purchased a Doxie Go scanner a couple of years ago and it has been great for creating digital copies of my documents. However, I had a need to OCR PDF files I'd previously scanned that were not OCR'd at the time. Doxie software is designed to only process images scanned with a Doxie device, so at first it seemed I could not use this as my solution. After a little digging around, I found a way to leverage the Doxie software to do it for me.
- Doxie Scanner
- Doxie Software - Used for OCR
- ImageMagick - Used to convert non-OCR'd PDF file to JPG
- Exiv2 - Used to modify Exif image metadata to add the camera model "Doxie Go"
NOTE: I used a Windows 7 machine to perform this work.
Stage 1 - Convert PDF to JPG
Doxie scans documents into a JPG file, which is then imported from the scanner into the Doxie software. The data is stored in the file system in a couple of places:
This holds the actual JPG file imported from the Doxie scanner.
SQLite database used to catalog Doxie scans that have been imported along with any edits that are performed in the software. Use DB Browser for SQLite to view.
To leverage the Doxie software, I first had to convert my PDF files to JPG files. ImageMagick will do this with ease using the following CMD:
> convert -density 600 C:\temp\test.pdf C:\temp\test.jpg
 after the PDF file will convert a specific page. If omitted, all pages will be converted into their own JPG file.
Stage 2 - Modify Exif Image Metadata
Once I had the JPG files, I needed to add a camera model of
Doxie Go to the file image metadata in order for the Doxie software to recognize it as coming from my Doxie Go scanner. Otherwise, it will not import the files into the software. For this, I used Exiv2 on the CMD:
> exiv2 -M "add Exif.Image.Model Doxie G o" c:\temp\test.jpg
To verify the Exif data, run the following:
> exiv2 c:\temp\test.jpg
Stage 3 - Import Into Doxie Software
- Copy the JPG files onto the Doxie scanner's file system under
H:drive is where the Doxie Go scanner is mounted on my machine.
- Open the Doxie software and import the files.
- Make any necessary changes to the scanned images and save the files using one of the OCR options. That's it!
I hope you've found this information helpful in applying OCR to some of your older scanned documents. I'm planning to explore whether the Doxie scanner is needed in this process, so you wouldn't even need to own the device to utilize their software. If I find a way to do this, I'll update this post.