Find all your DIY electronics in the MakerShed. 3D Printing, Kits, Arduino, Raspberry Pi, Books & more!
ocr_20090125.jpg

Shaun Friedle created an impressive piece of Javascript which can automatically defeat CAPTCHAs used by the Megaupload file hosting service. While their CAPTCHAs are particularly weak, it’s an impressive Javascript feat that breaks into some new territory, namely Javascript-based optical character recognition. John Resig posted a breakdown of how the software works. Here’s the quick summary:

  1. The HTML 5 Canvas getImageData API is used to get at the pixel data from the Captcha image. Canvas gives you the ability to embed an image into a canvas (from which you can later extract the pixel data back out again).
  2. The script includes an implementation of a neural network, written in pure JavaScript.
  3. The pixel data, extracted from the image using Canvas, is fed into the neural network in an attempt to divine the exact characters being used – in a sort of crude form of Optical Character Recognition (OCR).

Shaun designed the software as a Greasemonkey script that will break CAPTCHAs for Megaupload and automatically trigger a download. The code is designed specifically for this CAPTCHA style, but there’s no reason why the getImageData trick combined with a alternate OCR implementation couldn’t be used to solve for other systems. This is pretty fascinating stuff.

Is there a better (more convenient, harder to cheat) way to prove humanness? What else could you make in Javascript using OCR, neural nets, or per-pixel image processing?

Megaupload Auto-fill CAPTCHA
MuCaptcha Online Demo
OCR and Neural Nets in JavaScript – John Resig


Related

Comments

  1. MakeFan says:

    Cool coding, however a neural network is simply a memorization table. Even them changing the CAPTCHA letters to lowercase will break this code. And things get real messy when they introduce funny letters and random lines.

  2. Anonymous says:

    Neural networks have the ability to resist some noise, allowing them to see through any funny letters and random lines. You can also teach them lower case letters.

  3. Brian says:

    This probably requires two different checks. Something like a questions ‘Which of the following is a Rabbit’. Then show several images of various animals. The software would need to be able to do two things, firstly parse the English sentence as a question and understand what it means. Then be able to decode several images and determine the match.

  4. Stephen says:

    While MegaUpload uses a remarkably uncomplicated three letter CAPTCHA to validate genuine downloads I think this is extremely nice of them and calling it weak, while true, is somewhat the point.

    Who here has stared in incomprehension at the “pay to make this go away” cats and dogs CAPTCHA on Rapidshare? That’s the other end of the scale for “security”. Being able to punch in three letters and get your speedy, bulk download is wonderful in a world where the CAPTCHA has become a tool to generate revenue.

  5. selfSilent says:

    Captcha and other identical ideas are there to stop spambots from entering areas like forums. Why the hell would you want to help bypass this? It’s just gonna help the spammer.

  6. Shaun Friedle says:

    Stephen, Rapidshare has no captcha at all anymore, they have seen the light – both rapidshare and megaupload restrict free users to one download at a time and limit the amount they can download in a day and that is enough to prevent anyone overusing their service anyway.

    selfSilent, I don’t know how a script that breaks the captcha on a download site will help anyone get into a forum, spammers already have much more sophisticated software capable of breaking captchas on forums, the only thing that’s new about my script is it uses javascript.

  7. Anonymous says:

    I need a CAPTCHA decoder to reserve for some 50% deals on a website. It starts by making you click reserve, then goes to a page with a CAPTCHA code, then you have to click reserve again. They are usually limited to 100 items, and sell out in less than 10 seconds. If I could figure out how to get a decoder to get past that for me it would be so much easier. I need to read a bit more about these decoders.

  8. Can we get the source files?

  9. Can we get the source files?

  10. Can we get the source files?

  11. Can we get the source files?