Crawler spam is what happens when a bot crawls through your site and leave fake data. They might leave a fake referral just to get you to check out the referring website.
In fact I've had this exact thing happen. I was looking at my analytics and thought someone mentioned one of my articles on Reddit because over a few day period I got several hundred visits. I wasted my time digging through analytics to try to find the exact post.
And in the end I came to this page:
Reddit has already cleaned this up. They reported this as spam and it's pretty obvious now. But if you got hundreds of hits from this page wouldn't you be curious? Wouldn't you click the link to see how this relates to your site and the surge in traffic?
The whole goal is to leave fake referral data so you click on one of the spammy links. Crawler spam like this dirties your data and will influence any decisions you make based off of that data. We want to prevent any of this from actually being recorded so you don't click on spam links and so you can make smart decisions with your data.
Create a Filter
This process is going to be very similar to creating a filter for language spam. So I'll abbreviate the steps here:
- Log into Google Analytics
- Pull up a report for one of your views
- Click on Admin
- Click on Filters
- Click + Add Filter
- Fill in the following fields (see below for the Filter Pattern)
- Click Save
Now the Filter Pattern is so long that we actually have to break this into two filters. So copy the first one in. And then repeat the steps for a second filter.
I have to give kudos to Carlos Escalera for compiling this list.
The Challenge With Crawler Spam
Crawler spam is really hard to detect. It can look identical to a browser requesting information and they can send identical data to Google Analytics.
The only way to filter out this data to is use a list of known spam website referrers which is what we're doing above.
The downside of using known spam websites is that spammers can keep making new ones and your filter won't catch them. It can feel a bit like whack-a-mole.
The good news is that while crawler spam is hard to prevent it's less common than you think. It requires a lot more resources than ghost spam, where a program sends information directly to Google Analytics without actually crawling your website.
Don't worry about preventing 100% of crawler spam. It's impossible. But at least by filtering out the most common known sources you're going to drastically reduce it.
Verify Your Filter
It's always a good idea to verify your filter. Make sure you don't have a typo that will eliminate legitimate data.
You can do this before you press the save button. You should either see a list of spam being filtered out, or since the verify button uses a small subset of data you might see the following error message:
This filter would not have changed your data. Either the filter configuration is incorrect, or the set of sampled data is too small.
As long as you don't see legitimate data you're good to go.