When I first started hosting my own webpages, I noticed many requests for a file called robots.txt on my machine from the log files kept by IIS. Here’s an example but it was generated on a Linux OS, Apache web server.
255.255.255.255 – – [02/May/2004:05:34:56 -0700] “GET /robots.txt HTTP/1.0” 404 – “-” “Googlebot/2.1 (+http://www.googlebot.com/bot.html)”
So I paid a visit to the link and learned how google and other search engines get their results from. Automated programs to scour the net for webpages. The link also teaches you how to setup the robots.txt so as to control how googlebot indexes your webpage.
Because I’m running off a low speed connection and was using one machine to do all my tasks, I can’t afford too much traffic coming in especially when I’m gaming. So I setup a robots.txt file to tell all bots/spiders/crawlers not to index anything from any of my websites.
Reasons being they suck up my bandwidth and they brought visitors who are searching for content I do not have. All those unneccessary traffic eats up my precious CPU cycles.
One of my websites, the IRC Quotes site, contains a name of a Japanese song with the .mp3 extension. Google indexed this and I had about 20-30 visits, over a long period of time, from people who came from Google and their search string was the same songname with the file extension. I wasted their time and they wasted my CPU cycles.
Lastly, here’s a link to the site that inspired me to write this entry. It also contains interesting information on Bandwidth and Data Transfer on webhosting providers and why you might want to create a robots.txt file to manage the crawlers. stargeek – Bandwidth and Data Transfer – Which is which?