Please, Don't Google My Webpage

DON'T Google my web page? Who would want that? If you have a website or a blog you WANT all the search engines to find your pages and index them. You want to be found, so that others can find you. Right? There are companies that specialize in helping you get to the top of search results.

But there are situations when you don't want your web page to be found. Maybe it's a login page, or it contains somewhat sensitive material (though it can't be very sensitive since you're posting it on the Net!) or it might contain images that you don't want people finding in a search. Maybe it's a page that can only be viewed by a registered user or after logging in at another page, so you don't want it indexed (even if someone who clicks the Google link wouldn't be able to see it without logging in - surely this has happened to you at least once when you clicked some NY Times or Chronicle of Higher Education story link).

Realize that if your page isn't indexed by the search engines, it's highly unlikely that someone will just stumble upon it.

A good example is this blog. This entry will get a few dozen hits in the next week from our subscribers or regular visitors. But if I check the stats for this piece in six months, they will be much higher. (In fact, I just looked back 6 months in this blog and the entry from 9/26/06 has 1,258 hits now.) How does that happen? Well, that entry was indexed by Google and all the rest of the sites that send out their (ro)bots. So when someone searched on "podcast" or any of the other words in that piece (or in my tags), they found it.

So how do you stop those auto-searching, hungry little bots that Google and others send out to find new Net stuff? Google (I'm just going to say that to mean all search engines, OK?) has a set of computers that continually crawl the web. They know which sites have already been found and they read all the pages on each of those sites and they search for new ones and changes that have been made to the old ones. This collective of computers is known in Googleland as "Googlebot."

But webmasters can put a file called robots.txt on the server too which is a standard document that can tell the bots not to download some (or all) information from your web server.

I don't want to get too technical here. (That's Tim's job.) but I'll give you the basics and add a few links to sites where you can get a lot more information.

This robots.txt file provides restrictions to search engine robots. Those automated bots are at least courteous enough to check for a robots.txt file before they access pages of a site.

Even simpler is using a robots META tag on a page. That lets the HTML author indicate to a bot that the page shouldn't be indexed, or used to harvest more links.

This method doesn't work with all bots, but it will stop most of the main ones. (Are you starting to envision all these bots as some kind of invading organism like in the 1960's movie Fantastic Voyage or that stuff that attacked the ship in The Matrix?

So that META tag would look like this:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">  Then the robot should neither index this document, nor analyze it for links.

There is actually a format standard called the Robot Exclusion Standard.

If your page/site has been online a while, it probably has already been found and adding or changing your robots.txt file won't be immediately reflected in search results, but it will be discovered and used when the next bot crawls your site.

Google actually uses several user-agents, so you can block access to any or all of them.  If you block "Googlebot" you stop the bot that looks for things for their web & news index, but if block "Googlebot-Mobile" you stop the one who crawls pages for their mobile index (maybe your pages aren't mobile-ready, so you want them ignored) or block "Googlebot-Image" to stop your images from being found.

There are fine tunings too. Look at this one - what do you think it does?

User-Agent: Googlebot
Disallow: /documents/
Allow: /documents/disclaimer.html

That would block all pages inside the folder/directory called "documents" except for that one disclaimer.html page.

My point is this - sometimes you just don't want to be found. Know how to hide. The bots are always out there...

Links to find out more tech info on all this:

http://www.robotstxt.org/

google.com/webmasters/ - lots of stuff for web folks about how to interact with Google.

And here is a good & simple 2 part blog entry from the Google folks about this topic:

googleblog.blogspot.com/2007/01/controlling-how-search-engines-access.html

googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html

Trackbacks

Trackback specific URI for this entry

Comments

Display comments as Linear | Threaded

No comments

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
BBCode format allowed
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA