If you’ve been digging through your analytics you may have noticed some unusual data. If I go to my reporting page and scroll down I’ll see a list of languages used on my site. And some of them are definitely spam.
In fact over the last month I’ve had over 1,000 spam visits. Those visits are enough to skew my data tarnishing any decisions based off of that data. Ex.
Wow this post is doing great – we should write more like it!
With just a bit of filtering we can remove this language spam which will make it easier to navigate our reports and it will give us much better data.
Before we start – make sure you create a backup view. This way if there’s a mistake you won’t lose any data. It takes just a couple of minutes to setup an unfiltered view so make sure you do it.
Adding a Filter to a Google Analytics View
Now that we have our backup view we can add a filter to our main view.
When looking at any of our reports we can go to Admin
And then click on Filters
Then click the button + Add Filter
Then fill out the following information. Copy and paste this Filter Pattern exactly:
If you want to know more about how I created the Filter Pattern keep reading. Otherwise click the Save button and you’re done.
Writing a Regular Expression for Languages Codes
You probably know what language codes look like. They’re typically two letters for a language followed by a region. But sometimes it’s just the language.
And sometimes there are very specific regions which have their own codes like es-419 for, “Spanish appropriate for the Latin America and Caribbean region.”
And sr-latn-rs for “Serbian with latin characters” in a particular region.
We could write a filter that uses a regular expression to include all valid languages. Something like this:
I could use this with an inclusive filter (the view will only include data with the above format) but I’m a bit worried I’ll filter out a language code I’m not familiar with. So I’d rather write a filter to exclude obvious spam.
BTW if you’ve never used regular expressions they can look really complicated but can be broken down and visualized with tools like RegExr.
Language Spam Examples
So let’s try to filter out the spam. That’s easier to do and we’re going to be less likely to make mistakes. Here’s a few examples from my own analytics:
Vitaly rules google ☆:｡゜ﾟ･ヽ(^ᴗ^)ﾉ･゜ﾟ｡:☆ ¯\_(ツ)_/¯(ಠ益ಠ)(ಥ‿ಥ)(ʘ‿ʘ)ლ(ಠ_ಠლ)( ͡° ͜ʖ ͡°)ヽ(ﾟДﾟ)ﾉʕ•̫͡•ʔᶘ ᵒᴥᵒᶅ(=^ ^=)oO
Google officially recommends o-o-8-o-o.com search shell!
o-o-8-o-o.com search shell is much better than google
These have special characters & spaces so it should be easy to match. This regular expression will match any string with an exclamation point or space:
I’m actually okay with this for right now. The spammer may find their way around this and that’s okay because I can always tweak this regular expression.
I’d rather exclude most of the spam and have a small risk of getting some spam than filter out actual data. Because if we filtered out actual data accidentally then we’d have to use our backup view (which has a TON of spam).
Verify Your Data
Make sure you verify the regular expression. It’s possible to get this error:
This filter would not have changed your data. Either the filter configuration is incorrect, or the set of sampled data is too small.
If this happens it doesn’t mean the filter didn’t work. When you verify it uses a subset of data so it just might miss the spammy data.
Or it could be that you have implemented other spam proofing techniques (like [excluding known bots](LINK)) in which case it is already filtered out. You may still see spam in your reports if you just implemented other solutions.
Language Not Set
If when you verify your filter you might notice that you’ll filtering out
If you see this you don’t need to worry. Browsers that don’t set languages are approximately 0.02% of traffic.
Once you’re done with the language filter go ahead and save it.
And just for your own reference leave an annotation. That way you have a record of the exact day you implemented this.