andrew.hedges.name / blog
Defeat comment spam? Yes we can!
December 26, 2008, 12:26 pm · 8 comments · Filed under: Web Development
Thanks, but I’m not looking for a Russian bride and somehow I doubt yours is really a “trusted pharmacy.”
Comment spam is a scourge, as we all know.
I don’t claim to be an expert on these things, but I have stumbled on a set of techniques that have been 100% accurate for preventing comment spam on this blog. And, no, it’s not security through obscurity, smart ass.
I log every comment form submission, spam or not. Since I implemented my system a month ago, I have had 189 submissions. Of those, 11 were legitimate comments. The rest were spam.
All of the legitimate comments were published immediately. None of the spam comments were published. Not bad, eh?
As promised, the following slides explain in some detail the system I’ve implemented. I welcome your comments on these techniques below.
Short URL to this article:
Tweet this article!
8 comments
Thanks for your comment, Daniel. Thoughtful as always! I like the idea of inter-referring hidden fields. Greylisting is an interesting idea, though I’m not sure how to implement it. I’m not too inclined to go the cookie route as I don’t consider them reliable.
The “triskelion” is actually a 45 record spindle, also known as a generational differentiator. ;-)
Very good post Andrew. I have issues with things like Captchas especially ones with lowercase letters and numbers. Is that a one or a L?
It should be easy for a real user to post a comment. So far the I’ve only had one issue with Akismet on my blogs and that was on your comment.
I’ll likely end up having to write a comment system at some point, so I’ve bookmarked this post for when that day inevitably comes. :)
For the sake of full disclosure, I just had to delete my first false negative, a spam message that leaked through the system last night.
So, that makes 45 days and 256 submissions (14 real, 242 spam) with just 1 failure. Not bad if that continues to be the failure rate.
It remains to be seen whether they’ve figured it out and the arms race is on, or if the spammers just got lucky!
Hmm. I don’t mind comment moderation.
The authors of most sites on which I have commented have moderated quickly, so the delay doesn’t bother me.
On the other hand, pages of non-spam comments that were full of invective and name calling would bother me more.
Comment moderation also makes me bookmark sites to see what transpires.
And being inconvenienced by having to complete a captcha or wait for comment moderation are effective ways of reinforcing the message that spammers are a pain in the behind.
Update: As of March 30, 2009, 7 spam comments have slipped through my defenses and 1 legitimate comment was moderated out of 1189 total submissions. That’s a 0.67% failure rate. Pretty good.
One recent development is that my blog appears to have been targeted by real human spammers at least twice. This calls for another level of protection.
As of today, I’ve added stop words to comment submissions. Now, if certain words are present in the comment submission, the comment will be flagged for moderation.
I’ve also added a limit to the number of URLs that can be present in the comment. More than 3 links will cause the comment to be flagged for moderation.
I’ll report back in a couple of months on whether this change further reduces the failure rate.
After receiving a couple of what looked like human-crafted spam messages, I have implemented a couple more security measures.
I regularly add additional stop words whenever spam gets through, so at least the spammers have to find a new vector for attack. Additionally, I’ve now added the requirement that a particular cookie be set (which expires after 24 hours, to mimic the time limit on submissions themselves) and I am now looking for a few, known-spammer user agent strings.
That oughta slow ‘em down for a while.
Defeating spam has turned out to be a cat-and-mouse game, as I suspected it would. Overall, I’m very happy with how my little scheme has performed over time. Here are the latest stats:
- 3,431 total comment submissions
- 57 legitimate comments
- 3,374 spam submissions
- 18 false positives (spam messages that leaked through)
- 1 false negative (legitimate comment that was flagged as spam)
That’s a failure rate of 0.55% skewed heavily towards allowing messages through that might be legitimate. If you’d told me 6 months ago that I’d achieve that level of success with a homegrown system, I’d have been thrilled.
Lastly, I gave in to peer pressure and have removed the feature where comments close after a period of time. Jeff Atwood of Coding Horror fame said of the technique, “I officially dub that lame.” He’s right that it’s a “boil the ocean” “solution,” but it was only there to cut down on the number of spam submissions I received. Anyway, all of my posts are open for commenting now. Happy, Jeff?
Cool strategies Andrew,
At www.klixo.co.nz we evaluate the IP of the submitter after they have submitted the form. If the IP is unusual (for the website concerned) we display a message and a CAPTCHA. For example, if the client’s website is a NZ business, and they get a form posted from Nigeria, we will display a CAPTCHA (after the form has been posted).
This allows most of the websites legitimate customers to never see a CAPTCHA.
Other countermeasures include: silently blacklisting any IP that submits without a referrer (a common signature of an automated submission). Silently blacklisting any IP that fails (client side) validation.
In our system, a blacklisted IP just means that they will always see a CAPTCHA, allowing users at falsely blacklisted IPs to still submit legitimate forms, if they can solve the CAPTCHA.
Because our Form-to-mail system is centralised (SaaS) and used by all of our customers’ websites, we have built up a good database of IP reputation through the methods described above and the dreaded “report as spam” button in our CMS.
Incidentally, if a user of our CMS reports a message as Spam, the Spam report is moderated by our helpdesk before being taken as read. You would not believe how many spam reports are not actually Spam! (I wish Yahoo!Xtra would do this, but that’s another story).
We get thousands of forms submitted each day. Before the human intervention, the failure rate is very, very low (you would have to ask the helpdesk @Klixo how low), but after the small amount of human intervention described above it is exactly 0%.
I hope any of this is useful in your fight against Spam.
Regards, Daniel Larsen, Director, Klixo Ltd, Whakatane.
Comments close automatically after 15 days.
Still have something to say? Drop me a line!
One thing my friend Michael did for a client that seemed to work was to put innocuously-named but hidden form fields on a form. I see you’re doing something similar and it’s working still. You could also mix up the field names; the one labeled
namecould actually be interpreted on both ends as theemailfield, for example. The automatic software is what you’re out to defeat; (sorry friend) your blog is not important enough to warrant paid spammers at this time.Another thing he did, which I’m not sure would work today, was to have hidden form fields with values in them that had the same names as form fields that did not. Browsers at the time posted just one form field of a given name, and it would be whatever the input fields were rather than the hidden fields. The automatic spam technology of the time wasn’t that smart; it would populate the input fields and then merge the hidden fields into that, overwriting the input fields. (This was an accidental discovery.) I think nowadays more of the spam software works by driving a regular browser.
I can promise you there is some reprieve because the automatic software works heuristically. If your comment submission target script is named the same as (for example) Drupal or Wordpress’s comment submission script, you’ll probably find spammers pumping data into it in the format Drupal or Wordpress expects, even if the form on the front-end doesn’t work that way. Look in your Apache logs for 404’s for pages with those names and I bet you’ll find the evidence. You can honeypot that shit. :)
An email anti-spam technique I have yet to see implemented in blog software is greylisting. The way it works in email is, the first time a unique IP tries to transmit mail to you, return a transient error for about an hour to that IP. If they come back and keep trying with the same message after that hour is up, let it through and whitelist the IP. If they don’t, they’re a spammer. I’m not sure what the analogue to that would be for comment spam, but it’s food for thought. Maybe having the comment script delay for 30 seconds or more before returning the first time, or having a click-through, followed by whitelisting the IP. Of course IPs are much more reliable when you’re talking about mail servers than when you’re talking about residential users connecting through a couple levels of NAT. Long-lived innocuous cookies, perhaps?
I heard from somewhere a while back that spammers were in cahoots with pornographers, and there were sites that would “verify your age” and ask you to do a captcha—a captcha foisted from, say, a Yahoo! mail account creation screen. The horny teenager became the unwitting spammer accomplice by trying to access a parasitic porn site that was really a mashup between a regular porn site and an email account generator. I wouldn’t be surprised if this practice were ongoing. It certainly makes me glad I’m not a big enough deal to be a target for one of these scums, having to participate in the arms race.
Very interesting stuff! By the way, where did you get the funky triskelion you’re using as an @ symbol? It seems to be on all your stuff lately.
★ Posted by: Daniel Lyons · December 29, 2008, 11:54 pm