Powerful Image to PDF Conversion with OCR for free with WSL

I decided to move as much as possible to cloud and I'm quite happy with the decision. Especially the OneDrive mobile app helps with scanning of paper documents. Its free for up to 10 pages and includes OCR

But what about the tons of images that are actually scans from documents? Sure, converting images to PDF is easy, and there are a lot of tools (usually already included in your favorite OS or browser as a virtual PDF printer). So what's the catch then? Well, encapsulating a image in a PDF does actually not really utilize the strengths of the PDF format which is storing text. And without text, no modern OS will be able to index your files to make your digital archive searchable.

My requirements were relatively simple
  1. Run OCR on existing PDF files and save the text in the existing PDF (or replace it)
  2. Convert on a image file and convert it to a PDF with embedded text from OCR
  3. Convert and merge multiple image files to one PDF with embedded text from OCR
  4. Running the above use cases should be as simple as possible (only a few clicks), or ideally from command line even. 💪

There are a couple of tools for Windows that offer OCR functionality, mostly integrated with some other PDF related functionality, like editing, merging, etc. Most of them are either buggy, feature overloaded, too expensive or simply not useable. My suggestion: Stay away from tools like OmniPage (too expensive for OCR only), SimpleOCR (seems outdated), FreeOCR (outdated), WonderShare PDFelement (yearly subscription). Its actually quite interesting how polluted the Windows ecosystem has become with half backed and overpriced PDF tools...

Think outside the Windows Box

Luckily, since Microsoft has integrated a Linux support to Windows (Windows Subsystem for Linux), Windows 10 is able to run thousands of applications that were previously only available for Linux. 
Most of them do run on on WSL these days and one of them is a very powerful set of tools called OCRmyPDF. It combines a couple of open source tools to analyze PDF or Images with Tesseract and save it as PDF. 

That's exactly what we need: Let's have a look how the following workloads are supported
  • Add OCR to existing PDF
    This is by far the most common use case. OCRMyPDF supports multiple languages and uses Tesseract under the hood.
  • Convert single or multiple images to PDF
    While OCRMyPDF can combine convert single images to PDF, it is suggested to use the included an additional tool called img2pdf and pipe the result to OCRMyPDF. Img2Pdf supports adding multiple files to one PDF and runs on the console as well

  • Merge Multiple PDFs
    This can be achieved with another tool, called pdftk. It support various merge and split operations on PDF. Function overview. Alternatively, you can also use PDFSam for Windows, that does the same but comes with a user interface.

  • Descrew PDF
    What is more annoying than a PDF but the pictures are not in shape or badly oriented? OCRMyPDF contains the an algorithm to fix skewed angle as illustrated here

Installation

Thanks to WSL (Windows Subsystem for Linux) its super easy to run the application under Windows (well, technically Linux, but anyway). It sounds a bit more complex, but bear with me, its super simple.

1. Install WSL for your Windows 10 installation

The below commands have to be executed on a elevated command line prompt. If you already have WSL installed, please follow the instructions on https://docs.microsoft.com/en-us/windows/wsl/install-win10 if you want' to upgrade to WSL2

Install WSL (Version 1). Do not restart yet after command has run!
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart

Upgrade to WSL2
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Restart is required. Do it manually.

2. Install Ubuntu

No, no MBR hacking, no Grub, and no dual boot is required anymore. Just navigate to the Microsoft Store and search for Ubuntu After installing, it will be shown as app in your start menu.
 
After starting Ubuntu the first time, you will be asked to set username and password for the installation. Also make sure that your install the latest updates to the system by executing the following command
sudo apt update && sudo apt upgrade

3. Install OCRMyPDF

We'll install OCRMyPDF using the same tools that we've used to update the system.
sudo apt install ocrmypdf

Some distributions come with outdated versions of OCRMyPDF. You need to download the package via pip. This process is explained in the online documentation

4. First Use

The simples way to use OCRMyPDF is through the console while running Ubuntu. For example, let's assume your file Letter.jpg is stored in the documents folder. The command to convert it to a PDF would be 
ocrmypdf /mnt/c/Users/mis/Documents/Letter.jpg Letter.pdf

The above command will run in WSL (Ubuntu), but we can also run any Ubuntu command within a Windows command line, just by preceding the command wsl to it. The following does basically the same, but we don't have to 'start' Ubuntu first.

Windows runs Linux commands on a standard command line prompt


Context Menu for OcrMyPDF
Well that works quite well but it's not really hand to type in these commands all the time. If you work with the Explorer a lot for organizing files, there is a trick. A simple adjustments to the system will show the following additional menu item for supported file types like jpg, png, tiff, pdf, etc.

To add the menu item, you have to add a couple of registry items or you can just import the following snippet.

You can download the snipped also from my GitHub Page. Please note that " have been escaped with \". You can make the necessary changes to the registry also manually, just make sure that you copy only the values and un-escape them before.

Configuration to enable drop down menu item

The first entry describes a new context menu to be shown for the file types that match the AppliesTo constraint. I will be shown under the name "Convert to PDF/Run OCR" and even has an icon ("shell32.dll,68).

The second entry is a little bit cryptic and deserves some explanation. 
  • The Parameter %1 is the actual path to the file where the context menu item was clicked
  • The part $(wslpath "%1") translates the parameter from a windows path to a path where WSL can access it. It makes use of the quite powerful concept of Command Substitution
  • The second part has an additional | cut -d'.' -f1).pdf. This is to remove the file extension and replace it with pdf
  • Finally we're passing the argument --force-ocr to the application, so that existing PDFs will be analyzed also

Bonus: Additional Tools that I find handy

PDFSAM Basic (https://pdfsam.org/)


PDFSam is as small utility that helps with merging and occasionally re-ordering pages in a PDF. Nothing more and nothing less. It just works out of the box and is free. It want's you to buy the professional version, but you can simply hide the additional premium features from the UI. Great stuff.

SumatraPDF (https://www.sumatrapdfreader.org/free-pdf-reader.html)

SumatraPDF won't win any award for the pretties user interface and their installer will make you think twice before installing. But the minimal UI has its benefits. This PDF reader is super fast and does just one thing right: Displaying PDFs! Slim and fast. Don't forget to enable the Windows Explorer preview handler, so that PDF will be previewed in Windows Explorer.

Other Command Line Tools

The following command line tools can make handling of your PDF/Images even simpler, as they run entirely on console.

Comments

Post a Comment

Popular posts from this blog

Home Assistant in Docker with Nginx and Let's Encrypt on Raspberry Pi

Use Bodmer TFT_eSPI Library with PlatformIO

Migrating from Arduino IDE to Visual Studio Code to PlatformIO