Please, Don't Google My Webpage

DON'T Google my web page? Who would want that? If you have a website or a blog you WANT all the search engines to find your pages and index them. You want to be found, so that others can find you. Right? There are companies that specialize in helping you get to the top of search results.

But there are situations when you don't want your web page to be found. Maybe it's a login page, or it contains somewhat sensitive material (though it can't be very sensitive since you're posting it on the Net!) or it might contain images that you don't want people finding in a search. Maybe it's a page that can only be viewed by a registered user or after logging in at another page, so you don't want it indexed (even if someone who clicks the Google link wouldn't be able to see it without logging in - surely this has happened to you at least once when you clicked some NY Times or Chronicle of Higher Education story link).

Realize that if your page isn't indexed by the search engines, it's highly unlikely that someone will just stumble upon it.

A good example is this blog. This entry will get a few dozen hits in the next week from our subscribers or regular visitors. But if I check the stats for this piece in six months, they will be much higher. (In fact, I just looked back 6 months in this blog and the entry from 9/26/06 has 1,258 hits now.) How does that happen? Well, that entry was indexed by Google and all the rest of the sites that send out their (ro)bots. So when someone searched on "podcast" or any of the other words in that piece (or in my tags), they found it.

So how do you stop those auto-searching, hungry little bots that Google and others send out to find new Net stuff? Google (I'm just going to say that to mean all search engines, OK?) has a set of computers that continually crawl the web. They know which sites have already been found and they read all the pages on each of those sites and they search for new ones and changes that have been made to the old ones. This collective of computers is known in Googleland as "Googlebot."

But webmasters can put a file called robots.txt on the server too which is a standard document that can tell the bots not to download some (or all) information from your web server.

I don't want to get too technical here. (That's Tim's job.) but I'll give you the basics and add a few links to sites where you can get a lot more information.

This robots.txt file provides restrictions to search engine robots. Those automated bots are at least courteous enough to check for a robots.txt file before they access pages of a site.

Even simpler is using a robots META tag on a page. That lets the HTML author indicate to a bot that the page shouldn't be indexed, or used to harvest more links.

This method doesn't work with all bots, but it will stop most of the main ones. (Are you starting to envision all these bots as some kind of invading organism like in the 1960's movie Fantastic Voyage or that stuff that attacked the ship in The Matrix?

So that META tag would look like this:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">  Then the robot should neither index this document, nor analyze it for links.

There is actually a format standard called the Robot Exclusion Standard.

If your page/site has been online a while, it probably has already been found and adding or changing your robots.txt file won't be immediately reflected in search results, but it will be discovered and used when the next bot crawls your site.

Google actually uses several user-agents, so you can block access to any or all of them.  If you block "Googlebot" you stop the bot that looks for things for their web & news index, but if block "Googlebot-Mobile" you stop the one who crawls pages for their mobile index (maybe your pages aren't mobile-ready, so you want them ignored) or block "Googlebot-Image" to stop your images from being found.

There are fine tunings too. Look at this one - what do you think it does?

User-Agent: Googlebot
Disallow: /documents/
Allow: /documents/disclaimer.html

That would block all pages inside the folder/directory called "documents" except for that one disclaimer.html page.

My point is this - sometimes you just don't want to be found. Know how to hide. The bots are always out there...

Links to find out more tech info on all this:

http://www.robotstxt.org/

google.com/webmasters/ - lots of stuff for web folks about how to interact with Google.

And here is a good & simple 2 part blog entry from the Google folks about this topic:

googleblog.blogspot.com/2007/01/controlling-how-search-engines-access.html

googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html

Ideastorm and User Forums


Dell recently launched a customer relations site called IdeaStorm where registered users can submit product and feature requests, gripes about Dell etc. You can also vote on those posts if you think they are noteworthy. I know what you're saying, "You mean they ripped off Digg.com?"

Hold on.

The day before IdeaStorm arrived, I read a piece on Techcrunch about a similar Yahoo! site which you can access at http://suggestions.yahoo.com.

Now, Dell did give a hat tip to Digg saying that their IdeaStorm is “a combination of message board and Digg.com.”

Do a blog search and you find a lot of people upset about these new sites (a lot of dedicated Digg users especially). I'm more aligned with Michael Arrington on Techcrunch and some others who feel it may be a good thing.

How many sites out there are doing their own versions of a YouTube post-your-video site? Meneame & Hugg are two other Digg clones. So why pick on Dell & Yahoo?

I think what bothered folks about these two latest sites is that they are Big Companies and they want to protect the daughter & son startups (somehow "mom & pop stores" seems wrong for any of these web 2.0 sites) from the big sharks up there in the food chain.

Seems to me that Dell, Yahoo and others are seeing what works on other social sites and using it on their own sites - and maybe to their own purposes. I like the idea of being able to go to a corporate site and post my thoughts (positive & negative) about products. I like being able to get responses from other users. Oftentimes, their responses are faster and more useful than waiting for the official tech support.

Of course, the company is no doubt hoping to get unsolicited feedback for R&D on products and, conspiracy theorists that most of us have become, we suspect that The Man is somehow censoring, altering, and manipulating these social sites. Personally, I wouldn't want to be the employee assigned to monitor one of these sites if that is the corporate intent, but...

I know that in our months of getting ready to launch NJIT on iTunes U (and still today), the most useful place to turn for information was the Apple Support Discussion Forums. There was someone from Apple monitoring and responding at times (Hurray for Duncan!) but most of the discussion is from other folks in the trenches.

All this is not new. I've been sifting through places like the Adobe Product Forums for years, posting questions about Dreamweaver, looking for information on Flash. If the company is getting some good from it too, all the better. Let's hope it leads to positive changes in the products.


Spring Cleaning


"Serendipity will get you through times of no Internet better than Internet will get you through times of no Serendipity" --Perfesser Pedagogue

Heeding those words of wisdom:while Serendipity35's code sorcerers update the magic spells that guide us through the cyberspace cloud (and may also knock us offline intermittently over the next 24 hours), we direct you to the current list of spammers and general e-mail bad guys for your reading pleasure.

If your socks drawer is already arranged, if you've already washed and waxed the cat, and if you know that even Spring Training baseball on TV is still a week away: there is no better time to update your spam filters to keep your e-mail Spring Cleaning right on track.

You might notice hotmail.com (still) on and yahoo.com (newly) on the banned e-mail spammers list. Hotmail.com has failed for years in preventing spammers from using their hostname; yahoo.com has had recent problems with spammers flooding the market with real and faked yahoo.com e-mail relays. It is likely that yahoo.com will far sooner have its current spam troubles solved than hotmail.com ever will.

And it is probable that the code warriors at Serendipity35 will have the blog back up and running, by the time any of us get this whole spam block thing figured out. It just might take a little (more) serendipity.

ThinkFree


ComputerWorld did a comparison of online office suites. With Microsoft's major changes to the new Office 2007, there are probably a new group of current Office users who might consider alternatives - especially if they are free.

What is appealing about online office suites?

  • documents stored online are available anywhere you can access the Internet
  • works on multiple operating systems
  • no software to buy, download or install
  • upgrades/fixes done automatically and professionally by the company
  • ability to share documents without emailing or shared network access

They tested four popular online office suites and looked specifically at the word processor and spreadsheet components:

  1. Ajax13
  2. Google Docs & Spreadsheets
  3. ThinkFree Online
  4. Zoho Office Suite
And the winner was ThinkFree, which includes Write (word processor), Calc (spreadsheet), and Show (presentation software). ThinkFree is compatible with Microsoft Office and with Windows, Macintosh, Unix and Linux systems.

Their review includes details - for example, Write offers two modes of operation - a Quick Edit minimal interface mode for simple editing that is more WordPad than Word and Power Edit which is more robust with menus, toolbar, ruler bar, drawing toolbar, AutoShapes, text boxes, clip art, and pictures. You can also insert images from clip art, files, from Flickr. There's also the ability to save files as PDFs. ThinkFree can generate HTML web pages.

Can Educators Really Ban Wikipedia?


I saw the headline in various forms around the Net: Middlebury College Bans Wikipedia as Academic Source. I've been seeing versions of that from K-12 school districts and university departments for a few years.

The open-source, free encyclopedia that lets anyone create and edit citations, has gotten attention & criticism for a few highly publicized (overly publicized, in my mind) incidents where there was incorrect information online.

So, Middlebury's history department instituted a "ban." Because of errors that may occur in this dynamic online source, Middlebury's history department instituted a policy that says, "Wikipedia is not an acceptable citation, even though it may lead one to a citable source."

An Op-Ed piece on the Middlebury student weekly website "Wikipedia ban is a slippery slope" says, "To me, this stinks of the beginnings of censorship."

Still, the eye-catching headline really should say, as a commenter to the Op Ed wrote, that the ban is really "nothing more than a limit upon citation. You're still allowed to consult Wikipedia, just like always. What you can't do is cite it in your paper..."

When I was teaching middle school students, the librarians always told students not to use the encyclopedias as sources and encouraged teachers coming to the library with their classes to do research to ban the use of encyclopedias as acceptable sources for an assignment.

They should have just thrown away the encyclopedias and stopped ordering the new editions, because it was still the first place the kids went for an assignment.

I started to require students to use the encyclopedia first as a way to get an overview before they narrowed their broad topics to the ones I wanted them to use - which were narrow enough that there was little in the encyclopedia that would help them.

This followed right along with my practice of starting off a literature unit on a book like To Kill A Mockingbird by passing around copies of the Cliff Notes, Monarch Notes & all the other things I thought they might buy to avoid reading the book. I would use the essay questions in those books in class and make sure my students knew that I was not going to use them as assignments.

To think that my students wouldn't use the encyclopedias or the Cliff Notes because I "banned" them, would have been foolish.

In a article from The New York Times, Jimmy Wales, the co-founder of Wikipedia, commented on the Middlebury policy in a way I have heard him comment many times before this:

“I don’t consider it as a negative thing at all.”

He continued: “Basically, they are recommending exactly what we suggested — students shouldn’t be citing
encyclopedias. I would hope they wouldn’t be citing Encyclopedia Britannica, either."

“If they had put out a statement not to read Wikipedia at all, I would be laughing. They might as well say don’t listen to rock ’n’ roll either.”

The Wikipedia Foundation supports the new policy. According to the Burlington Free Press, in an e-mail to the newspaper, the Foundation said Wikipedia is an "ideal place to start your research and get a global picture of a topic; however, it is not an authoritative source."



Weekend Twofer

There are so many interesting photos on Flickr at this point. And it has its own search capabilities (by tags, including geotags (map places) , user, even by the type of camera used.

Still, I find this Flickr Grab site at interesting to use. It grabs the photos marked as most interesting by users today. You can click a photo and jump over to it on Flickr, or check the LIVE box on the site and just preview a series of photos on the site itself. You can put in whatever keyword you want to search and it filters your search through the "interesting" rating on photos. The result is a cleaner version of the search, user reviewed results actually.

Widgipedia is aiming to be "the ultimate resource for both users and developers of widgets and gadgets." They catalog Web & desktop widgets. If you are a creator you can upload your software, collaborate, use tutorials, even grab code samples.

What's a widget? According to Wikipedia:

A web widget is a portable chunk of code that can be installed and executed within any separate HTML-based web page by an end user without requiring additional compilation. They are akin to plugins or extensions in desktop applications. Other terms used to describe a Web Widget include Gadget, Badge, Module, Capsule, Snippet, Mini and Flake. Web Widgets often but not always use Adobe Flash or JavaScript.