Defeat comment spam? Yes we can!

December 26, 2008, 12:26 pm · 8 comments · Filed under: Web Development

Thanks, but I’m not looking for a Russian bride and somehow I doubt yours is really a “trusted pharmacy.”

andrew.hedges.name



Can your web browser do this?

You’ll never get rich digging a ditch, nor building Dashboard widgets.

A Kryptonite™ lock can be defeated in 11 seconds, but you still lock your bike, right?

Gaining Twitter followers is a little like losing weight. You have to try.

Over or under? It’s the age-old question when it comes to the orientation of toilet paper rolls.

Subscribe


Meta Me

I am a web developer, recently returned to the States after 3 years in New Zealand. I’m into my family, photography and frisbee sports.

Blip.fm Digg Facebook LinkedIn Stack Overflow Twitter Zooomr

Nothing will benefit human health and increase chances for survival of life on earth as much as the evolution to a vegetarian diet.
Albert Einstein


Topics

Apple · AppleScript · Business · Coda · CSS · Dashboard · Design · Google · InSTEDD · JavaScript · jQuery · Life · Marketing · Music · New Mexico · New Zealand · Open Source Software · Photography · PHP · Politics · Ruby on Rails · Scree · Subversion (SVN) · Twitter · Usability · Web Development · Widgets


Archives


Most Popular

CSS Fast Nav: Because (perception of) speed matters! · Personal Branding for Introverts · Stupid WebKit Tricks · Add an interactive legend to a MarkerManager managed Google Map · Dude. Mikeyy can’t even spell his own name. · Dashboard Widgets for Fun and Profit · Animating your iPhone web application · How-to recover from checksum mismatch errors in SVN · Why Apple can afford to charge so little for Snow Leopard · The first 48 hours of PHP Function Reference, by the numbers


Most Recent

CSS Fast Nav: Because (perception of) speed matters! · When is a global variable not a variable? · Our misguided culture of cool · InSTEDD: Open Source Software that saves lives · Add an interactive legend to a MarkerManager managed Google Map · Personal Branding for Introverts · Moments of Rangitoto · Some Twitter conventions · Why Apple can afford to charge so little for Snow Leopard · Stupid WebKit Tricks


Twitshirt

Twitshirt is a tweet on a shirt. Buy the one below or check out my most recent tweets.

How many times do I have to tell Barack? No playing ball in the whitehouse! http://tr.im/k306

See a random Twitshirt-worthy tweet.


Friends

80/20 · 90 Seven Design · Alyson Hurt · Andrew Nimick · Apps & Hats · Ben Young · Brian Arnold · Brian Warren · Carl Bolter · Chris Burgess · Christine Morris · Cristina Stoian · Daniel Lyons · Daniel Schwartz · David Hedges · Hamish Campbell · Jochen Daum · John Visser · Joseph McLaughlin · Joshua Sallach · Julian Pistorius · Justine Sanderson · Kalena Jordan · Katie Graham · Kelly Green · Kevin Potis · Mark Bixby · Matt Henry · Method Arts · Morgan Pyne · Peter Michaux · Philip Tellis · Piers Harding · Rebecca Murphey · Reid Givens · Rey Bango · Rhett Anderson · Richard Paul · Rob Pongsajapan · Robin Taylor · Ryan Park · Shaun Lee · Simon Young · Su Yin Khoo · Toni Barrett · Vaughan Rowsell · Vincent Thomé · Voom Studio


Recommended Books on
Web Development

My bias is for references over “cookbooks.” I want to know all of my options, not just one way to do something. Show me the why as well as the how and I am happy.

JavaScript: The Good Parts · Object-Oriented JavaScript: Create scalable, reusable high-quality JavaScript applications and libraries · JavaScript: The Definitive Guide · Designing with Web Standards · CSS: The Definitive Guide · Prioritizing Web Usability · The Elements of User Experience · Web ReDesign: Workflow that Works · Don't Make Me Think: A Common Sense Approach to Web Usability


Contact Info

Contact info for Andrew Hedges


I’ve hosted this website with pair Networks since 1997. They rock.

This blog is powered by…software I wrote.

Feeling generous? Knock yourself out!

Comment spam is a scourge, as we all know.

I don’t claim to be an expert on these things, but I have stumbled on a set of techniques that have been 100% accurate for preventing comment spam on this blog. And, no, it’s not security through obscurity, smart ass.

I log every comment form submission, spam or not. Since I implemented my system a month ago, I have had 189 submissions. Of those, 11 were legitimate comments. The rest were spam.

All of the legitimate comments were published immediately. None of the spam comments were published. Not bad, eh?

As promised, the following slides explain in some detail the system I’ve implemented. I welcome your comments on these techniques below.

Defeating Comment Spam
View SlideShare presentation or Upload your own. (tags: spam comment)

Short URL to this article:
Tweet this article!


8 comments

One thing my friend Michael did for a client that seemed to work was to put innocuously-named but hidden form fields on a form. I see you’re doing something similar and it’s working still. You could also mix up the field names; the one labeled name could actually be interpreted on both ends as the email field, for example. The automatic software is what you’re out to defeat; (sorry friend) your blog is not important enough to warrant paid spammers at this time.

Another thing he did, which I’m not sure would work today, was to have hidden form fields with values in them that had the same names as form fields that did not. Browsers at the time posted just one form field of a given name, and it would be whatever the input fields were rather than the hidden fields. The automatic spam technology of the time wasn’t that smart; it would populate the input fields and then merge the hidden fields into that, overwriting the input fields. (This was an accidental discovery.) I think nowadays more of the spam software works by driving a regular browser.

I can promise you there is some reprieve because the automatic software works heuristically. If your comment submission target script is named the same as (for example) Drupal or Wordpress’s comment submission script, you’ll probably find spammers pumping data into it in the format Drupal or Wordpress expects, even if the form on the front-end doesn’t work that way. Look in your Apache logs for 404’s for pages with those names and I bet you’ll find the evidence. You can honeypot that shit. :)

An email anti-spam technique I have yet to see implemented in blog software is greylisting. The way it works in email is, the first time a unique IP tries to transmit mail to you, return a transient error for about an hour to that IP. If they come back and keep trying with the same message after that hour is up, let it through and whitelist the IP. If they don’t, they’re a spammer. I’m not sure what the analogue to that would be for comment spam, but it’s food for thought. Maybe having the comment script delay for 30 seconds or more before returning the first time, or having a click-through, followed by whitelisting the IP. Of course IPs are much more reliable when you’re talking about mail servers than when you’re talking about residential users connecting through a couple levels of NAT. Long-lived innocuous cookies, perhaps?

I heard from somewhere a while back that spammers were in cahoots with pornographers, and there were sites that would “verify your age” and ask you to do a captcha—a captcha foisted from, say, a Yahoo! mail account creation screen. The horny teenager became the unwitting spammer accomplice by trying to access a parasitic porn site that was really a mashup between a regular porn site and an email account generator. I wouldn’t be surprised if this practice were ongoing. It certainly makes me glad I’m not a big enough deal to be a target for one of these scums, having to participate in the arms race.

Very interesting stuff! By the way, where did you get the funky triskelion you’re using as an @ symbol? It seems to be on all your stuff lately.

Posted by: Daniel Lyons · December 29, 2008, 11:54 pm

Thanks for your comment, Daniel. Thoughtful as always! I like the idea of inter-referring hidden fields. Greylisting is an interesting idea, though I’m not sure how to implement it. I’m not too inclined to go the cookie route as I don’t consider them reliable.

The “triskelion” is actually a 45 record spindle, also known as a generational differentiator. ;-)

Posted by: Andrew Hedges · December 30, 2008, 8:23 am

Very good post Andrew. I have issues with things like Captchas especially ones with lowercase letters and numbers. Is that a one or a L?

It should be easy for a real user to post a comment. So far the I’ve only had one issue with Akismet on my blogs and that was on your comment.

I’ll likely end up having to write a comment system at some point, so I’ve bookmarked this post for when that day inevitably comes. :)

Posted by: Josh Kendall · January 3, 2009, 8:51 am

For the sake of full disclosure, I just had to delete my first false negative, a spam message that leaked through the system last night.

So, that makes 45 days and 256 submissions (14 real, 242 spam) with just 1 failure. Not bad if that continues to be the failure rate.

It remains to be seen whether they’ve figured it out and the arms race is on, or if the spammers just got lucky!

Posted by: Andrew Hedges · January 8, 2009, 8:43 am

Hmm. I don’t mind comment moderation.

The authors of most sites on which I have commented have moderated quickly, so the delay doesn’t bother me.

On the other hand, pages of non-spam comments that were full of invective and name calling would bother me more.

Comment moderation also makes me bookmark sites to see what transpires.

And being inconvenienced by having to complete a captcha or wait for comment moderation are effective ways of reinforcing the message that spammers are a pain in the behind.

Posted by: David · March 24, 2009, 1:13 am

Update: As of March 30, 2009, 7 spam comments have slipped through my defenses and 1 legitimate comment was moderated out of 1189 total submissions. That’s a 0.67% failure rate. Pretty good.

One recent development is that my blog appears to have been targeted by real human spammers at least twice. This calls for another level of protection.

As of today, I’ve added stop words to comment submissions. Now, if certain words are present in the comment submission, the comment will be flagged for moderation.

I’ve also added a limit to the number of URLs that can be present in the comment. More than 3 links will cause the comment to be flagged for moderation.

I’ll report back in a couple of months on whether this change further reduces the failure rate.

Posted by: Andrew Hedges · March 30, 2009, 10:06 am

After receiving a couple of what looked like human-crafted spam messages, I have implemented a couple more security measures.

I regularly add additional stop words whenever spam gets through, so at least the spammers have to find a new vector for attack. Additionally, I’ve now added the requirement that a particular cookie be set (which expires after 24 hours, to mimic the time limit on submissions themselves) and I am now looking for a few, known-spammer user agent strings.

That oughta slow ‘em down for a while.

Defeating spam has turned out to be a cat-and-mouse game, as I suspected it would. Overall, I’m very happy with how my little scheme has performed over time. Here are the latest stats:

That’s a failure rate of 0.55% skewed heavily towards allowing messages through that might be legitimate. If you’d told me 6 months ago that I’d achieve that level of success with a homegrown system, I’d have been thrilled.

Lastly, I gave in to peer pressure and have removed the feature where comments close after a period of time. Jeff Atwood of Coding Horror fame said of the technique, “I officially dub that lame.” He’s right that it’s a “boil the ocean” “solution,” but it was only there to cut down on the number of spam submissions I received. Anyway, all of my posts are open for commenting now. Happy, Jeff?

Posted by: Andrew Hedges · May 15, 2009, 10:34 pm

Cool strategies Andrew,

At www.klixo.co.nz we evaluate the IP of the submitter after they have submitted the form. If the IP is unusual (for the website concerned) we display a message and a CAPTCHA. For example, if the client’s website is a NZ business, and they get a form posted from Nigeria, we will display a CAPTCHA (after the form has been posted).

This allows most of the websites legitimate customers to never see a CAPTCHA.

Other countermeasures include: silently blacklisting any IP that submits without a referrer (a common signature of an automated submission). Silently blacklisting any IP that fails (client side) validation.

In our system, a blacklisted IP just means that they will always see a CAPTCHA, allowing users at falsely blacklisted IPs to still submit legitimate forms, if they can solve the CAPTCHA.

Because our Form-to-mail system is centralised (SaaS) and used by all of our customers’ websites, we have built up a good database of IP reputation through the methods described above and the dreaded “report as spam” button in our CMS.

Incidentally, if a user of our CMS reports a message as Spam, the Spam report is moderated by our helpdesk before being taken as read. You would not believe how many spam reports are not actually Spam! (I wish Yahoo!Xtra would do this, but that’s another story).

We get thousands of forms submitted each day. Before the human intervention, the failure rate is very, very low (you would have to ask the helpdesk @Klixo how low), but after the small amount of human intervention described above it is exactly 0%.

I hope any of this is useful in your fight against Spam.

Regards, Daniel Larsen, Director, Klixo Ltd, Whakatane.

Posted by: Daniel · July 8, 2009, 6:56 pm

Comments close automatically after 90 days.
Still have something to say? Drop me a line!

Possibly related posts