search
Carter Cole LinkedInCarters Twitter PageCarter Cole on Facebook Carter Coles RSS

Sunday, November 8, 2009

Break a simple image CAPTCHA... its not that hard

CAPTCHA always seemed like it was a kind of one word challenge to me. it stands for Completely Automated Public Turing test to tell Computers and Humans Apart and its what is the industry standard to try and keep my bot from scraping your web service or posting dirty comment spam but some are very crazy... even too hard for a human to read... and how hard are they really to break? I have done allot of reading about the theory of the CAPTCHA and tried to break one before (see as far as i got below i stopped at anti-alias rotating but segmentation and cleaning the image was done so i didn't have that much more)

partially broken captcha with rotated image and random lines
points to whoever can tell me where this captcha came from...



so for my second i wanted to choose an easy one. the CAPTCHA i choose had these features

  • image has changing static (easy to filter)
  • no letter rotation
  • fixed width font
  • pattern to solution (letter-number-letter-ect)
fixed width font is the worst offender on this list. it lets you eliminate the hardest part of breaking a test, its called segmentation and its hard, once you get it down to one letter tho OCR is normally accurate to like 97% but these bots are sending literally millions or requests so a captcha is considered broken if it can be solved even half the time. that brings me to the second problem with this test... i can validate my answers. because every phrase follows a letter number letter pattern i can check to see if my bot got the right answer without sending it. This is never good because it lets the attacker check their work but lets get into how i broke this one

The captcha i broke with 94% accuracy
(sample CAPTCHA with mouse-over effects)
the first thing that is meant to mess up a bot is the noise and theirs wasn't so good. first it didn't disrupt the letter that much AND if i requested the image again i got back the same riddle with different static. This let me make a filter to extract only the pixels that occur in both images giving me the clean letters. i have read about neural nets and used examples but never implemented my own so i decided to go with the easy way of guessing the letters. i created templates that represented the perfect symbol and them compared them byte by byte to determine which letter it was most like...

heres one of my templates for the number 5

111111101
110000001
110000101
110111001
111001101
000000111
000000111
110000111
011001101
001111001

heres what the program spits out while solving the test data...
I Guess its a Y with a 97.50 %
I Guess its a 8 with a 98.75 %
I Guess its a L with a 100.00 %
I Guess its a 2 with a 98.75 %
I Guess its a Q with a 100.00 %
I Guess its a 5 with a 100.00 %
I Guess its a G with a 98.75 %
I Guess its a 7 with a 97.50 %
I Guess its a Q with a 97.50 %
I Guess its a 6 with a 98.75 %
I Guess its a O with a 100.00 %
I Guess its a 8 with a 100.00 %
i was right! it was Y8L2Q5G7Q6O8
I Guess its a Y with a 97.50 %
I Guess its a 8 with a 98.75 %
I Guess its a L with a 100.00 %
I Guess its a 2 with a 98.75 %
I Guess its a Q with a 100.00 %
I Guess its a 5 with a 100.00 %
I Guess its a G with a 98.75 %
I Guess its a 7 with a 97.50 %
I Guess its a Q with a 97.50 %
I Guess its a 6 with a 98.75 %
I Guess its a O with a 100.00 %
I Guess its a 8 with a 100.00 %
i was right! it was Y8L2Q5G7Q6O8
(i cut out a bunch of them Google was saying that my site had relevance for the work "guess" oops :)
and at the end

i checked a total of 265 files with a success rate of 94.34 %


im very happy with its accuracy (especially because im not using neural net) below are some more screen shots from the app i made to look at the data and test my solver...


you can see above the program knows where the number 8 is because of fixed width font. this made it easy to extract just the letter to be analyzed

this is the training data input screen. the text box validates the input as valid solutions changing color when the pattern isn't followed so bad training data isn't entered by mistake


this was a very simple test to break and it had very weak security features that let even a simple attack defeat it with great accuracy. if you are interested in other poor captchas you may be interested in my post about text based math captchas and why they are so easy to bypass aswell.

i plan to be doing some more work with captcha breaking (ill probably step it up and take on one that needs a neural net) soon i wrote this code a few month ago and its been just sitting so i thought i would share what i learned and how simple it can be to break one of these. i don't like captchas because they are like locks on doors, they only keep the honest people out. bandwidth is getting cheaper and cheaper you should encourage people and companies to consume your services and learn to monetize the traffic not implement stupid little pictures that a good bot can read anyways that just waste the humans time as they are trying to figure out if its a 0 or a capitol O. id love to hear your thoughts on the subject and will respond if i can help so please take the time to write a comment if you have any questions

0 remarks:

Post a Comment

Link to this post if you found it usefull

Break a simple image CAPTCHA... its not that hard