Skip to content

Using Google Docs Optical Character Recognition

18 December 2010

From 1991 through 1996, I wrote on a Brother word processor that was comprised of an electronic typewriter attached to a monitor.  The monitor displayed amber characters and it was great.  Edits could be made on the screen without needing to be corrected on the paper.  My resume and cover letters were stored on one diskette.  I had a database for mail mergers on another diskette.  It did well for my job search then, but the main reason I had the word processor was to write fiction.  I filed two file cabinet drawers with fiction and submitted one short story.  It was published and later nominated for an Illinois Arts Council Award.  Then life and work interrupted and I let that dream slip away.

The other day I opened the file cabinet and pulled out a story.  It was a 5,000 word short story.  I started to read.  After a couple paragraphs, I set it down.  I asked myself who wrote this story, even though it had my name and old address on it.  I just couldn’t fathom that I had written something so good.  I read on.  A page later, I realized it was mine.  I remembered the story, its multiple drafts, and the development of the characters.  I also saw it needed one more pass to call it a final draft.

I decided to try Google Docs optical character recognition from my desktop computer which runs Ubuntu Linux 10.10.  I dusted off my Canon CanoScan N676U scanner and scanned 20 pages.  I really did not want to type the whole thing.  I saved the pages as a PDF docuament.  I opened it in Evince and saw everything looked fine and attempted to upload it to a folder in Google Docs.  It bombed after several minutes of uploading.  The file was too big.  I should have looked at Google Docs upload limits first.

I opened PDF Chain and split it apart.  I uploaded the document in sections to Google Docs and had it OCR the pages.  From what I’ve read, Google Docs uses the Tesseract engine.  It performed very well.  The occasional lowercase L was recognized as a number one, but not very often.  Oddly a line of prose was transposed on one of the pages.  But what made it nice was that Google Docs retained the PDF and embedded it in the document, so I could compare it against the OCRed text.  I then copied and pasted the text from each section into a single document.  I downloaded the reconstituted document to my computer and imported it in the Scrivener for Linux beta that I am testing (I will be writing more on Scrivener at a later date).

The upload limit on Google Docs is a pain.  It created more steps than I wanted. It meant I had to work on it by sections and reconstitute it again.  I thought scanning the original document was going to take the most amount of time, but it didn’t.  I spent more time vivisecting the scan’s complete PDF for upload, uploading the sections, and merging them back together again.  This has me reconsidering using Google Docs OCR abilities.  I mean if I were to scan the 250-page novel that I also have in my file cabinet, it would take more time than that story is worth.  Yes, that novel is that bad.  Even another short story would be a pain to OCR via Google Docs.  I can use OCRFeeder with the Tesseract engine on my desktop and get the same quality results as Google Docs OCR offers without creating additional steps.

I’m all about saving steps as I digitize many of my short stories and writing journals.  I’m ready to renew my dreams.  One short story at a time.

About these ads
2 Comments leave one →
  1. Natalie permalink
    20 December 2010 6:35 pm

    It’s nice, at least, to see this technology starting to become more reliable. Perhaps in the future there will be further steps and more accuracy! In the meantime, however, I wanted to mention one of my favorite free online OCR software programs. It’s offered by Ricoh Innovations and anyone can try it out at: http://beta.rii.ricoh.com/betalabs/content/document-conversion

    • richardfcrawley permalink*
      20 December 2010 8:22 pm

      Thank you for commenting, Natalie. The Ricoh site has a 20 MB upload limit. That’s better than Google’s limit at the moment.

      I configured OCRFeeder to use the Tessseract engine and now do the OCR process in Ubuntu with the same results as the Google Docs OCR option.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: