reCAPTCHA: distributed book digitization while fighting spam

Thanks to spammers, we now are forced waste a substantial portion of time every day, typing in obfuscated wiggly letters to prove we are human. reCATPCHA is a slick idea for using the CAPTCHA system for doing something productive (…besides distinguising between homo sapien and homo computatralis).

With reCAPTCHA, the user is given two words, one known by the system and one from a book that previously failed character recognition. When the user enters both words, the sytem verifies the known word, proving human-ness, and submits the second word to a central database, which helps digitze books from the Internet Archive. With 60 million CAPTCHAs being solved every day, this could be a huge assist for portions of text that can’t be handled by optical character regognition techniques. [via] Link

Related:
Negative CAPTCHA

0 thoughts on “reCAPTCHA: distributed book digitization while fighting spam”

tpe says:

June 4th, 2007 at 12:14 am

Client A then attempts to connect to a range of ports on client B’s machine. All these requests will fail at client B’s firewall, of course. However, in the process a side effect has occurred. Client A has told its own firewall to allow traffic from all of client B’s scanned ports! Now, when client B attempts to connect to client A, assuming its outgoing port was previously scanned (which it likely will be), the request will get through to client A’s machine.

I don’t think this is correct. Portscanning a remote system won’t “tell the firewall to allow traffic from […] the scanned ports”. This would be a really lame way to firewall. Instead, firewalls and NAT systems work on established sockets. Another socket connection that is utilizing the same port won’t magically get through just because that port has been used recently.

jason_striegel says:

June 4th, 2007 at 1:34 am

Think about how a connection is created in a typical scenario:

You send a packet from 1.2.3.4 port 1234 to 5.6.7.8 port 80. When this packet goes through your NAT router, it creates a lookup table entry that says 1.2.3.4 port 1234 is communicating with outside server 5.6.7.8 port 80.

5.6.7.8 then responds with a packet from 5.6.7.8 port 80 to 1.2.3.4 port 1234. When your NAT router sees this, it looks at the lookup table, checks to see if there was a mapping there, and determines whether to send the packet on (which it does in this case) or discard it.

So essentially, sending a packet to a machine on the other side of your NAT router causes the router to allow incoming packets from that machine, with the specific from and to ports that the original packet contained.

When two clients, both behind their own NAT router, need to talk to eachother, they can coordinate a set of ports with a third-party public server. Then it’s a matter of punching holes through their own routers until packets start coming through and both sides see traffic.

FYI, this is made much easier with UDP (as opposed to TCP), as there are no sequence ids to deal with.