University of Bristol | ILRT | IntDev blog

Jump to content Subscribe via RSS

This is a blog from the Internet Development Team at ILRT, Bristol. We build websites and web applications for a wide variety of customers, many in the UK higher education sector. Continue reading…

That most secretive of animals, your website audience

In our second article, Kieren Pitts tackles one of the thorniest topics regarding website usage: how many people use my website?

An easy question with a tricky answer

A common question asked by web authors and content managers is “How many people are accessing my content?”. In the world of traditional publishing, this is a good measure of the success of a publication. Unlike the publishing world where it is relatively easy to know how many copies of a publication have been read (or sold), in the online world it’s not as easy.

A little bit of technical stuff…

The internet is a large collection of interconnected computers forming a huge network. For these computers to communicate with each other (for example, to send email) each computer must be identifiable. To do this, computers are assigned an Internet Protocol (IP) address. IP addresses take the form of four numbers separated by dots, for example 192.168.1.100.  Each of the numbers can range from 0 to 255.

The request in simple terms

When users follow a link to a web page their web browser sends a request for that page to a web server (the computer storing the web page). The server then sends the page to the user’s browser (the browser sends subsequent requests for images etc. within the page). Web servers log these requests and record (amongst other things) the file requested, the IP address of the computer requesting the file and the date and time of the request.

Server log analysis

Many web usage (sometimes just called web stats) analysis tools, such as AWStats (the tool we use on our client’s websites) and Analog, rely on reading through the log files of requests. A busy website might have millions of requests per month and server log analysis tools produce a summary report by examining the logs. Reports might contain the number of pages requested, the number of unique users visiting your website and even the duration of their visit.

A screenshot of AWStats interface, a commonly used web log analysis tool

Unfortunately, these figures are almost totally meaningless because of assumptions used to calculate them or simply the architecture of the web.

So why can’t I tell how many people visit my site?

If a computer has an IP address and is used by one user then why can’t we say that each IP address appearing in the web server logs is equivalent to one user? Well, that’s often how web log analysis tools estimate the number of users visiting your site.

Unfortunately you can’t assume a single IP address represents a single user because the internet is a bit more complicated than this. For example, a request from a user might go through a proxy server before reaching your web server. Proxy servers request files on behalf of the user’s computer and pass the files back to the user’s computer. This means the proxy server’s IP address appears in the web server logs and not the IP address of the user’s computer.

Why use a proxy server? Commonly they’re used to create a web cache and reduce an organisation’s bandwidth usage (the amount of data being sent back and forth to the organisation across the internet), which they may have to pay for.  If someone requests a file that another person using the same proxy server has already requested, then they will get a cached copy from the proxy server. This means users get pages more quickly as they are not being requested from external servers.

In large organisations every computer might access the web via a proxy server. Consequently, hundreds of people might access your website but only one IP address (that of the proxy server) will be recorded in the logs. web log analysis tools will report all these people as a single user.

Doesn’t this mean that some people won’t reach my website at all?

Yes, you’re absolutely right. If users are receiving cached copies of your pages then the requests won’t reach your web server. As the requests won’t appear in the logs they won’t be counted when doing log analysis.

So, the number of users reported by log analysis is an underestimate?

You could be forgiven for thinking so, but I’m afraid it’s not that simple. Companies that deal with a lot of internet traffic (such as Internet Service Providers like AOL) run many proxy servers. This means that a single user might request five pages but each page request might come from different proxy servers (i.e. five different IP addresses). This means the user will appear to be five users in the report from your web log analysis tool.

Further problems are caused by web robots, also called crawlers or bots. These computer programs visit web pages, retrieve content and follow links in the retrieved document to find new pages. Web robots are used by search engines (like Google and Yahoo) to find and catalogue web content. If your log analysis does not exclude these robots then they will be reported as legitimate users. Many log analysis tools exclude robots but it’s not always possible to spot and exclude all the different ones.

A further consideration are regular users that use your site from lots of different locations. A user might access your site from work and then again from home. This user will almost certainly have accessed the site from two different IP addresses and be reported as two users by log analysis.

Do I have to analyse my web logs to estimate usage?

One popular alternative to log analysis is Google Analytics. Google Analytics uses a hidden code embedded in each web page on your site. This code is a computer program written in a language called JavaScript. Modern web browsers, when they load a web page containing a JavaScript program, run the program.

In the case of Google Analytics, the browser runs the code and this sends information to Google. This allows Google to identify the page the user is on, the type of browser and other information. Google uses cookies (small text files stored in the browser) to track users as they move around the website. Data sent by the tracking code must come from a human since web robots cannot run JavaScript, so their actions are not recorded.

A screenshot of the Google Analytics interface. Google Analytics relies on JavaScript tracking code rather than analysing log files.

Are tools like Google Analytics 100% accurate then?

Unfortunately not, the tracking code is written in JavaScript and not every user will use a browser that runs JavaScript. Text-only browsers and some assistive tools used by people with disabilities do not support JavaScript. In addition, some institutions (such as financial institutions and schools) remove JavaScript from web pages before they reach the user. In these cases the users can’t be tracked.

In our experience, and depending on the nature of your website, you may see 25 – 75% fewer users being reported by tools like Google Analytics compared with web log analysis tools.

Do I need to know the number of people visiting my site?

This is a difficult question to answer. Since you can’t determine this figure reliably then examining the figures in the short term is often meaningless.

However, assuming you record the figures in the same way over a long period (several years) then the lack of accuracy becomes less important. If you assume that the figures are inaccurate in the same way each month then changes between figures become interesting. For example, if you compare data from March 2008 and March 2009 the change in this data provides you with more meaningful information. If you do this for long enough then trends will emerge.

A further consideration is whether you can measure users in other ways. For example, you might choose to record the transactions (such as course bookings) made through your website each month. Over a long period these figures, which must be directly attributable to actual users, give an indication of usage trends.

Remember, web statistics are largely meaningless. It’s important to recognise the shortcomings and determine what’s best for your organisation. If your statistics suggest numbers of users are increasing but numbers of transactions are falling, then working out why this is happening is more important than increasing traffic.

In future postings we’ll look at determining the geographical location of users, dwell time (how long users spend on your site) and how to get the most from Google Analytics.

Useful links

Understanding web log statistics and media metrics

How reliable is Google Analytics

and arguably the most famous essay on the subject:

Why web usage statistics are (worse than) meaningless

This entry was posted on 20th February 2009 at 2:08 pm and is filed under Briefings. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

1 comment

  1. Good article.

    One comment though. “Unlike the publishing world where it is relatively easy to know how many copies of a publication have been read…” – I’m not sure I would agree with this. For example, a single publication could be passed around and read by numerous people – think of all those periodicals in libraries and waiting rooms…

    So some of the caveats for web stats aren’t a million miles from print stats really.

css.php