The Thinker

Robots and Spiders and User Agents – Oh my!

Six weeks or so ago I complained about SiteMeter. I noticed big discrepancies between the number of visits and page views that it was reporting versus what my Apache server web log was showing. These discrepancies remain true. However, as I dig into the details of my web server logs I am beginning to realize that it is impossible to know how many people are looking at my site. Moreover, the same is true if you own a web site.

Apache does an excellent job of recording every hit to my site, as you would expect. The problem is separating the wheat from the chaff. Ironically, my investigation suggests that getting an accurate measure of a blog’s audience is due to news syndication technology that I have praised.

For example, this site is hit twice an hour by bloglines.com. Bloglines.com is a news aggregator. It sniffs through blogs and apparently caches content (i.e. stores a copy of my pages) on its servers. If a bloglines.com reader has subscribed to my blog, bloglines.com will present their local copy of my latest blog entries. I have no way of knowing how often a bloglines.com subscriber is reading their cached copy. However, I have some idea of how many bloglines.com subscribers are looking at my blog. That is because bloglines.com is nice enough to tell me the total number of subscribers in the user agent field. For my index.rdf newsfeed, for example, it reports 11 subscribers. I have one subscriber to my atom.xml newsfeed. Nevertheless, how often each bloglines.com subscriber looks at my newsfeed is unknown. Perhaps bloglines.com will someday share its statistics with us content providers. I did not see any information on their website suggesting they will be doing anything like this.

Bloglines.com is just one example of a news aggregator service. There are dozens of others. It is difficult to determine which are genuine aggregators and which are vanilla robots trolling my site. Arguably, they are the same thing. A robot is simply a program sniffing through my site for content and perhaps storing some of it in a search engine. A news aggregator agent does the same thing, but presents the content to people specifically interested in viewing my content remotely.

When a robot hits my site, clearly no human is reading the content. At some future time, a human may select a link through the robot’s web site that may take them to my site. Alternatively, like Google, they may be reading a local cache of my web page instead of hitting my site for the real thing. There is no way to know for sure but most likely, I am missing statistics for many page views.

There is also a lot of “noise” in my web server logs. I did a number of Google searches on some of these user agents. I discovered that many of these agents have malicious intent. They are looking for various vulnerabilities on my site, presumably to inject some viruses or Trojan horses. My own hits to my site are in the log, along with the program I wrote to examine the log. From my perspective those hits are noise to be filtered out. There are also many requests for pages that do not exist. All these need to be filtered out too.

In addition, new robot programs are unleashed all the time. Yahoo, for example, has two going. The normal one is Yahoo Slurp, but there is also Yahoo commerce robot called Yahoo Seeker. Yahoo also offers a news aggregator service. Unlike bloglines.com, its user agent string does not identify the number of subscribers. So once again, a webmaster like me can only shrug his shoulders.

There are many robots out there apparently. At some point, I am going to have to fine-tune a robots.txt file. This file is supposed to tell search engines the content that may be indexed on your site. Reputable robots obey the robots.txt file. Shady robots — and there are apparently plenty — do not bother and will index your content anyhow. To keep out the shady robots you have to fine-tune an .htaccess file to block them. This is generally a manual process, and thus avoided by many webmasters.

Overall, I think it is fair to draw a couple conclusions. “Metadata” is growing. Metadata is information about a website, and robot programs on remote computers stream across the Internet to your web site to collect it. Any kid in his basement can write his own robot program and start trolling your site for content (and suck up your bandwidth for no compensation). Robots are equal opportunity programs. As metadata grows, it becomes more important for a webmaster to be able to manage metadata collection. At some point, I expect there will need to be mechanisms more sophisticated than Apache .htaccess files. Perhaps web hosts will offer sophisticated application level filtering, allowing in a known list of reputable robots. Alternatively, perhaps they will let in only those robots that share their metadata. For a webmaster like me, this seems eminently fair. How many subscribers are trolling my site via your site? Who are they and how often do they come? If you are going to sift through my content the least you can do is accurately identify yourself and tell me how many people are reading my site and how often. I can see the Internet Engineering Task Force or the World-Wide-Web Consortium creating some standards for exchanging metadata statistics in the near future.

Second conclusion: the browser if not dying, is not exactly well. Looking through my web server logs I see all sorts of applications are reading my site. Not all are apparently robots. For example, there is the Microsoft URL Control. I Googled it and discovered that virtually any Microsoft program that uses the Internet uses this control. Therefore, someone clicking on a link embedded in a Word document is probably using this control. There are also Java programs talking to my web site. I see Java/1.5.0 and Java 1.3.1 as user agents. Whether they are more robot programs or application programs serving my content is impossible for me to figure out.

The bottom line is that it is becoming impossible to know how many human beings are reading your site. This is not good. For larger sites, this makes it difficult to provide meaningful metrics to advertisers who might want to pay to place ads on your site. It also leaves the webmaster scratching his head: is my site popular or not?

Perhaps it is close enough to take a SiteMeter reading and multiply it by some number to estimate the total readership. On the other hand, perhaps a webmaster should take his page views in his Apache web log and multiply it by some fraction. Whatever. It is a gross estimate at best. Perhaps the better metrics are simply a site’s Google or Bloglines.com ranking. Investigating page views and calculating visits is increasingly irrelevant.


Update 8/31/05 – I noticed on peculiarity: my own PC was hitting my index.rdf (newsfeed) file once an hour. This was very puzzling because I certainly wasn’t hitting this page. But apparently my browser Mozilla Firefox is doing this for me. As long as my browser is active then once an hour it checks the index.rdf file. This can give a lot of false positives. You may be assuming someone was reading the newsfeed when the “bot” was your own browser.

If it bothers you, don’t create a live bookmark to your site. Look with suspicion on entries in your web server log from a Firefox browser that happens once an hour. On the positive side it does mean that someone took the time to bookmark your newsfeed as a live bookmark, meaning they think your site is important enough to bookmark.

Curiouser and curiouser, as Alice would say…

 

Leave a Reply

Switch to our mobile site