Google Analytics is a free web analytics service that tracks and reports site traffic and provides analytical tools for search engine optimisation and marketing purposes. It's currently one of the most reliable and powerful tracking services and if you want to learn how to make the most of Google Analytics, this post on Harnessing the power of Google Analytics might be useful. Unfortunately, not long ago spammers came up with a new shady way to draw traffic to their sites. If you are somehow involved with digital, you might have already come across it. What I refer to here is the so-called 'referrer spam', also known as referral spam, log spam or referrer bombing.
Traffic is usually considered as one of the most valuable factors when evaluating a site's performance. The more valuable traffic a website receives, the more opportunities there are to encourage more conversions and revenue, as well as to improve ranking. There are many factors that influence the quality of the traffic that comes to your site. However, in order to determine the value of your site traffic, the first thing you need to know is whether this traffic is real or not...
Fake traffic and what it represents?
Fake traffic is generated by bots, whereas real traffic comes from human interaction. In the context of Google Analytics, fake traffic represents fake hits that are sent to your Google Analytics property. A spammer can easily fake 'hits' which appear as events, page views, screen views, keywords, transactions, and more - and they only need your Google Analytics property ID to do so.
Bots and what they do
A bot is software designed to perform repetitive tasks with high levels of accuracy and speed. In recent years, bots have become more common and even more difficult to detect. Regardless of the size of your site, there is the chance it can be visited by bots.
Depending on the purpose, there are legitimate and malicious bots. The former ones keep the web running smoothly, ensuring higher quality content is seen. An example is the 'Googlebot', that crawls site pages in order to determine ranking in SERPs, ultimately assisting the growth and development of the site. The latter, however, do exactly the opposite. They are non-human programs designed by spammers or hackers that generate fake ads to trick users into downloading malware, spread spam, and steal information. Legitimate bots can usually be blocked by the robots.txt directive. For those not familiar with the term, our friends from Moz have explained in detail all you need to know about robots.txt files. Sadly, spam bots can't be blocked. They tend to disguise themselves by pretending to be a web browser or traffic coming from a legitimate site so that your security system fails to recognise them.
Referrer spam and how it affects your Google Analytics account
A referrer is the link shared via the HTTP header when your browser navigates from one page to another. This information is tracked by your analytics platform, giving you various information about your audience. The referrer can unfortunately be easily replaced, and many spammers take advantage of that. They replace the actual name or link of the referrer with a site they want to promote, and send numerous requests to your site. Since referrals are being tracked, they are also shown in your reports, including the fake ones. A good way to identify spam is by looking at the bounce rate, as spam referral accounts usually have a very high bounce rate.
Referrer spam can seriously mess up your Google Analytics data. How big its impact is going to be depends on the size of your site. Needless to say, the bigger the business and the site, the less noticeable the traffic from referrer spam is going to be, but the danger is that it can easily mislead marketing analysis by masking legitimate traffic reports. Not only that, but the repeated requests can cause a higher server load which can result in an increase in the bounce rate and decrease in SEO rankings. In extreme cases, spammers might be trying to find vulnerabilities and other ways to break your site's security.
Spammers do this very same thing to thousands of Google Analytics accounts, most commonly through a bot. They hope that you'll see the link in your Google Analytics dashboard, presume it's a valid site and click on it.
In Google Analytics there are three types of referrer spam bots:
1. Spambots that visit your website, known as Crawler Referrer Spam
2. Spambots that don't visit your website, known as the Ghost Referrer Spam
3. The new type of spambots, known as Language Spam
But first, before you begin cleaning your data, you need to protect it from possible misconfigurations. The first thing to do is create the following three views:
- Master View – where you apply filters and the view to use for analysis.
- Unfiltered View – your backup view that has no filters or any settings that affects the incoming data. This will ensure you'll always have the raw, unmodified data should things go wrong as there is no 'undo' button when creating filters.
- Test View – where you first create your filters to experiment with them and test their functionality. It mirrors your master view. A good tactic is once you create a filter, wait for a few days and measure the data with the unfiltered view so you can test whether the filter actually works or not.
As the name implies, this method involves using a web crawler and actually requires visiting your page. Automated crawlers hit your servers and execute Google Analytics script, ignoring rules like those found in robots.txt. It is the less common type of spam and can be blocked via an .htaccess. These bots send out HTTP requests to thousands of websites, using fake referrer headers to avoid being detected as bots. When exiting your site, they leave a record on your reports that appears similar to a legitimate visit. The fake referrer header contains the website URL that the spammer wants to promote and/or build backlinks for.
How to stop it
To detect and fix referrer spam, you need to navigate to the 'Referrals' report in your Google Analytics view, change the date range to the last two months (at least) and report by bounce rate in descending order. Referrals with 10 or more sessions and either a 100% or 0% bounce rate are most likely to spam referrers. In case you're still unsure, you can either search the name of the referrer, there may be some information on them, and if not, you'll have to take the risk, actually click on the link and visit the site. Then add all spam referrers whose traffic you want to block from your Google Analytics account to a list. This list has to be converted into a regex like the one below to be later used for creating the Google Analytics view filter.
Not sure what a regex is? Follow this link for an introduction to regular expressions.
Once you have identified spam referrers, you need to block them as soon as possible in order to stop them from visiting your website. Since a visit is recorded in your server log, you can blog crawler bots through your .htaccess file.
Block via .htaccess
The key to stopping referrer spam is to block it before it actually has the chance to access your site and register as a referrer. The simplest way to do this is to add a code similar to this one to your .htaccess file in the root directory of your domain. This method is better than just blocking the domain in Google Analytics because it prevents spam bots from hitting your server altogether.
Your .htaccess file needs to be updated constantly. If you don't want to create a new file for each of your sites, you can create an umbrella .htaccess file by storing one in the directory which contains all of your sites. If you would like to use a unique .htaccess file for each of your sites you add all the spam links to a folder, update it regularly and then just copy and paste it.
If you're unfamiliar with .htaccess files, read this guide to find out what .htaccess files do and how to use them.
If you want to get creative you can even redirect the traffic back to where it came from by using a Deflector. You have to create a text file called deflector.map, which looks like this:
And then add the following code to your .htaccess file:
Through a custom advanced filter in Google Analytics
Only use this method when for some reason, you can't access the .htaccess file. If you can stop the crawler spam bots from visiting your site in the first place, you don't actually need to exclude them later from your Google Analytics reports. Fighting with spam at the server level is far more efficient and you should try and minimise the use of view filters as much as possible as it can create data sampling issues.
To create a new custom EXCLUDE filter, you need to navigate to the 'Admin' section of your test view and then click on the 'Filters' view. Add a new filter by copying and pasting the regex you should have created earlier in the 'Filter pattern' text box. The filter should then block all of the traffic from the spam referrers you have identified in the regex. If the filter is working properly in the test view, then apply it to your master view. Remember, as you discover new spammers you will have to add them to the filter, and this filter will exclude everything that matches so be careful with your expressions!
Our Digital Manager takes care of our Google Analytics and makes sure no spam impacts our data. Every time a new spammer appears in our ‘Referrals’ reports, we add it to our customised exclude filter, as you can see in the image below.
Click here to download a copy of the complete version of our spammers list. You can then copy and paste the list to your Google Analytics account and quickly create your own custom filter. Don’t forget that you have to add any new spammer you discover.
The vast majority of spam is this type. Its main characteristic is that it can send fake traffic without actually visiting your website, hence the name. It does that by sending raw fake hit data, known as 'ghost traffic', directly to your Google Analytics server via the Measurement Protocol. All they need is your Google Analytics property ID. Unless you are using Google Tag Manager, spammers could quite easily find your Google Analytics tracking code (UA-XXXXX-1), by simply looking at the page source of your site. Spambots then scrape your ID and share it with other spambots.
The tricky part here is that since Ghost bots don't visit your website, their visit can't be recorded in your server log, meaning that blocking the URL in your .htaccess file is impossible. Examples of such bots are darodar.com and event-tracking.com.
How to stop it
Due to the fact that ghost spam never actually visits your site, this type of spam would always leave a fake or 'undefined' hostname which appears as 'not set' in your reports. Use this to create a filter that only lets traffic with valid hostnames and excludes all ghost traffic automatically. This means that if you include traffic from only the hostnames which you can recognise, you can significantly decrease the impact of ghost traffic.
You need to be able to identify all valid hostnames. Any website where you're using your Google Analytics property ID is always a valid hostname. They may include e-commerce shopping carts or telephone call tracking services linked from your website. Also, the hostname pointing to your own website is always a valid one as you want to keep all the traffic coming from it as well.
Real visits to your website from a referral link always have two server names available: the Source that the link is from and the Hostname that the landing page is directing you to (your server). In the majority of cases, the Hostname should be your server, regardless of where traffic came from. What Ghost Spam does however is send traffic to a random series of tracking ID numbers, using only blank (not set) or fake hostname values. It never knows your server name. What this means is that you can simply eliminate all of the ghost visits by filtering to INCLUDE only the valid hostname – your server!
In order to create a filter that simply recognises valid hostnames, without letting any fake traffic pass, first you need to create a regex with all the 'hostnames' whose traffic you want to include in your Google Analytics view. Start with a multi-year report showing just hostnames. Navigate to Audience reports, expand Technology and select Network (the image below). To see a report with all your hostnames, fake and real, make sure you select Hostname at the top of the report as by default Service Provider is selected. Make a list of all relevant host names you find. You should see at least one that will be your primary domain. Finally, convert this list into a regular expression.
The same process from the example above applies here as well. First navigate to the 'Admin' section within the test view, click on the 'Filters' link and add a new filter. This time create a custom INCLUDE filter and copy and paste the regex you created earlier. Remember, it's important to choose include, rather than exclude as it was in the example above. When you're satisfied that the filter is working properly, you can duplicate the filter in your master view.
In late 2016, spam evolved to focus on inserting fake language. It was first seen on Nov 8, 2016, just around the time of the 2016 US elections vote. You might have seen a line on your audience dashboard under 'language' stating:
"Secret.ɢoogle.com You are invited! Enter only with this ticket URL. Copy it. Vote for Trump!".
It uses the language HTTP header (that is usually in the form of short abbreviations, such as "en", "en-gb", "en-us", etc.) to send messages and is combined with referral spam, with multiple domains listed as source/medium, from some legitimate sites such as Reddit or Twitter, to some not so legitimate such as abc.xyz, buketeg.xyz and biteg.xyz. It tries to get user's attention to both the fake referrer domains and the language report.
Usually referral spam has high percentage of new sessions and consequently very high bounce rate and pages/session. Language spam, however, may be affecting your average time on site and pages per visit metrics - but it only registers page views on your homepage, so metrics for internal pages should not be influenced. It was believed that this kind of spam would be short-lived due to its relation to the US elections. Unfortunately, it's only growing stronger.
How to stop it
Georgi Georgiev at Analytics Toolkit has put together a guide on how to get rid of Google Analytics Language Spam. Unfortunately, there is no way to permanently erase this type of traffic from your reports but Georgi offers two effective alternative solutions that can help you keep your reports free of spam for a good period of time.
Block via view-level filter
This is a fairly straight-forward and logical process where the goal is to filter out any traffic where the message within the language dimension exceeds 15 symbols, as the most legitimate language settings sent by browsers rarely go above 5-6 symbols. In addition, traffic from sources that contain invalid characters in the language field will be filtered out.
The regex Georgi has provided looks like this:
In order to create the filter, navigate to the 'Admin' section of your Google Analytics, select 'Filters' and click on the red button that says +Add Filter. Name the filter and change its type to 'Custom'. Apply settings as shown in the screenshot below. Make sure to filter the 'Language Settings' dimension.
Filter out via advanced segment
Filters cannot be backdated, so they won't help with your historical data. If you want to clean up past data when preparing reports, Georgi recommends using custom segments. To do that you need to go to your 'Demographics Overview' section and click 'Add new segment'. Within the language dimension add the regex from above, and be careful to choose 'does not match regex'. Once saved you can then apply it to most reports in Google Analytics.
Exclude filters do help getting rid of the spam traffic that pollutes your reports but you might not always want to use them as frequently. The reason for that is because every time you request a specific report, Google Analytics will go back to the unfiltered data, even if you've previously applied a custom segment. All of this raw data you think you're blocking out is actually going to stick around and will cause sampling issues for you. That's where Google Tag Manager appears to be the better solution and will help you minimise the use of filters in your analytics account.
Turn on Google's bots & spiders option
Google Analytics has the ability to exclude easy-to-identify bots & spiders. It's important to know that you need to enable this option for every view you have created. To do that navigate to the 'Admin' section and for each view you use select 'View settings', and check the box to 'Exclude all hits from known bots and spiders'.
Google Analytics is a very powerful tool that can help you understand your traffic and grow your business, but it's essential for you to ensure that the data you receive is actually accurate and trustworthy. An important part of getting reliable data is ensuring that it is clean and is an accurate representation of your users. Once you've cleared your account of traffic spam, we would suggest moving to a more secure tag management platform such as Google Tag Manager. This will make it impossible for bots to find your ID and will keep your analytics data protected.
If you're not familiar with Google Tag Manager or need help applying any of the solutions mentioned above, get in touch with us and have a chat. We'd love to help.