Posts Tagged ‘ OCR ’

Working for The Man

I recently wrote a paper for an Information Technologies class on OCR, or Optical Character Recognition—software that allows a computer to “read” text. It works fine for things printed in the past fifty or so years, but is pretty useless when it comes to older stuff. Yellowed pages, faded text and old typefaces still confound technology. Enter reCAPTCHA, which uses crowdsourcing to convert the text in these documents to digital (searchable, cut/copy/pastable, etc.) text. Everyone has encountered CAPTCHAs—the tests ticketing websites and the like give us to prove we’re not spambots. Many CAPTCHAs use randomly generated jumbles of letter and numbers as challenges, but reCAPTCHA uses words that OCR can’t identify from old books and newspapers. More specifically, it uses one word that has been identified and one that hasn’t. If you type the one the computer knows correctly, it assumes you’re also right about the unknown word. The program waits until a word has been keyed in the same way by at least three people, at which point it considers the word identified.

Pretty cool, right? Crowdsourcing works! We are preserving information and making it accessible! These ubiquitous online challenges, which are merely irritating when you get them right, and infuriating when you don’t (I am not a robot,  goddamn it!!), are actually serving the greater good!

Or are they? reCAPTCHA is the brainchild of Luis von Ahn, a Carnegie Mellon professor, but since 2007 the program has been owned and controlled by Google. The words that we identify are slowly but surely contributing to the digitization of the archives of the New York Times and the Google books project. Helping out the evil empire that is Google always made me slightly uneasy, but how am I supposed to feel about it now that they are in the business of selling e-books? As far as I can tell, the books they’re selling in their new eBookstore are not the same texts that reCAPTCHA is helping to digitize. But this is not a voluntary program, the way Wikipedia is—we are basically forced to take part in if we want to continue our day-to-day business online—and it is serving a for-profit entity. Frankly, it feels a little sinister to me.

When I first found out about reCAPTCHA, I was surprised that Google wasn’t making more of an effort to publicize the project. Wouldn’t they want people to know that their time and effort wasn’t being wasted every time they had to enter a string of letters into a textbox? Now, though, I understand why they’re not shouting about it from the rooftops. They’ve essentially turned everyone with an internet connection into an unpaid laborer without them even knowing it.

There is one thing to come out of the reCAPTCHA project that I have only good feelings about: CAPTCHArt. This is a website of comics that people have created based on the challenges. It is random, childish, often inappropriate, and delightful. See below.

 

image

 

Advertisements