Wednesday, July 13, 2011

Correcting OCR using hOCR in Firefox

Quick post on a little tool I came across, moz-hocr-edit. This Firefox add-on lets you proofread Optical Character Recognition (OCR) output. Given my interest in OCR and the Biodiversity Heritage Library I decided to take it for a spin.

moz-hocr-edit uses the hOCR, which is a format for representing the output of OCR software, and is used by tools such as OCRopus (you can see the public specification for hOCR here). Basically it's a microformat, that is, it's HTML with some additional tags. Given some hOCR, moz-hocr-edit enables you to edit the OCR output line-by-line.

Demo
I've created a simple demo based upon Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation. For the demo to work you will need to use the Firefox web browser with the moz-hocr-edit installed.

  1. Go to http://dl.dropbox.com/u/639486/hocr/80780.html
  2. You will see a simple HTML representation of the OCR text from "Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation". I created this HTML from the original ABBYY FineReader XML from the Internet Archive.
  3. On the bottom right-hand of the Firefox browser window you should see hOCR. Click on it and select "Edit this hOCR document":
    Statusbar
  4. Firefox will open a new tab that will look something like this:
    Screenshot
  5. You can now edit individual lines of text, and see your edits applied to the HTML below.
moz-hocr-edit is a neat little tool. With appropriate web server settings (and, as the tool's author Jim Garrison suggests, autoversioning) it could the basis of a great tool for correcting OCR errors in BHL.