Before we get to methods to fight comment spam, it is very important to understand how spammers and spambots work, what methods they use, what types of comments they are posting.
This analysis is based on the comment spam on www.dev4press.com. Dev4Press website is a network of over 20 websites, but comments are allowed only on the main website that includes the blog. Before we go on, you might wanna check out the previous article: Comment spam: how does it work?.
WordPress settings regarding comments are set so that anyone can comment (with or without the user account), comment authors need to have a comment approved first (so basically, all comments from new authors are moderated), and comments are held for moderation if they contain 3 or more links. There are no keywords in moderation list or blacklist.
Analysis period and basic stats
This analysis deals with the comment spam on Dev4Press for the whole 2015. In this period, Dev4Press blog received the total of 628925 spam comments. No spam appeared on the website, it was all caught by WordPress (comments with too many links) or by Antispam Bee plugin. The analysis is focusing on the structure of the comments (length of the comment, number of links) and on users accounts posting spam.
For this period, there was the total of 2,153,579 links in all the spam comments combined. That is 3.41 links per one spam comment. 538,913 spam comments were posted by registered user accounts. For this, spammers used 213 different registered accounts, but on average 60 accounts were used each month. These accounts used emails belonging to 54 different domains (many of these are no longer active).
Spam and links
Let’s start with a number of links in each spam comment. This is very important because the number of links is the easiest method to spot spam comment. Data also shows the number of comments that had links wrapped in BBCodes URL tag. This number was on the rise in the last 3 months of 2015, but I am not sure why.
And here is the chart.
Last 2 months of 2015, the number of links was rising, and the average number of links was over 5 links in each comment. The overall average for the whole year is 3.41. I am not sure what is the reason for this, but the most likely explanation is the fact that last two months had biggest holidays of the year and that influenced spammers to work more.
Some comments had more than 200 links, but most comments had only 2 links, and some had no links! Here is the chart:
The interesting thing about the comments with no links is that they most likely represent bugs in spammer software because comment text looked like something cut from the larger text and it was obvious that either beginning or end was missing. And the fact that most comments had only 2 links points to the tailoring of comments to pass undetected through WordPress default filters.
Spam comments size
The size of comments is directly linked to the number of links, and it is understandable that over 710% of all comments were under 1KB in the size of comment content. 29% of comments were over 1KB in size. In most extreme cases, there were comments with more than 20KB of comment content size. And, yes, there were comments that had content made of one (or few) word(s) only. Such comments rely on the single link in the URL field, and no links in content. Approximately 2% of all spam comments were under 20 characters in length.
Most important metrics in this analysis is showing how the spam got delivered. I was surprised to learn that over 85% of all spam comments were posted using registered user accounts. Dev4Press website allows free account registration, so spammers use that to create accounts that will be used to deliver spam. Logic is that spam filters will maybe let users post comments without control. And this only proves that spam filters need to check all comments regardless of the user account used.
And here is the chart:
Accounts used for spamming were overlapping from month to month. But, most accounts were active for 2-3 months. Most domains used for emails are no longer active, and most domains were related to ‘adult toys’ and all sorts of variations. But, there were regular Gmail, Yahoo, and other popular free mail services. Total of 7233 IP addresses were used to deliver spam to Dev4Press during 2015.
This whole analysis was very useful to me to better understand how spam gets delivered, what type of accounts were used and how the comments look like, what kind of content is included. Different websites will get different results, influenced by the comments settings and policy regarding account registration, but, I have compared data from other websites that allowed free accounts registration, and they also show the tendency that most spam comments get delivered by registered accounts.
Next posts in this series will deal with methods for spam prevention, starting with use of honeypots, followed by reCaptcha and other spam fighting methods.