Making sense of online textual information and information management technologies
   
 
Blogs and Google : The Future of Categorized Indexes
May 10, 2003

If you are the one who tracks news/events about information management and information retrieval world, then you must be cloyed and glutted with news about Google. Over the last two-three months, there have been a deluge of news and opinionizing across media about Google. But in this deluge of adulation and praise, some point-of-views stand tall and make their mark. One such article, written by Andrew Orlowski (Google to fix blog noise problem) in The Register , is causing quite a stir.

The present information-glut about Google seems to be pretty natural (including pontifications about its future and the sycophantic worship that ensues such pontification), when a company like Google is able to attract people at various levels - intellectually stimulating research, marketing and competitive strategy and even at the level of pure thrill of predictive-gossiping.

In this euphoric noise about Google, the article in the The Register (known for its "scoops" in the online media world) stands out. The article, which starts out with a purpose of reporting on a recent announcement by Google, goes on to offer a very interesting yet equally opinionated analysis of the blog-noise in Google's searches. The article claims that the very idea of PageRank seems to be subverted by inclusion of blogs in the main Google search index - as blogs thrive on linking each other - quite self-referentially evident here!.

The following are the threads of arguments laid out in the Register article:

  • Going by the past experience with Deja, Google is very much likely to remove blogs from the main index
  • Bloogers would welcome such a move
  • This move will act as a much-needed purging of Google main index, as at the moment Google is "polluted" by inter-linking noise created by blogs and this noise effectively subverts the PageRank system
  • Blogs contain huge amount of meta-data, information about information rather than actual information itself and hence blogs negatively affect the information quality of the Google main index

Now to counter balance this point of view let us look at another article on the same subject, written this time by Dr Elwyn Jenkins ( Google Blog Search: A Google News Model) in the Microdoc News- which I think is very sympathetic (to say the very least) to both Blogs and Google, despite its attempt to provide a well-rounded and balanced point of view. Here is a sum-up of the argument-threads covered in the article:

  • That there is some state of "anxiety" created by the article written by Orlowski and possibly by the announcement by Google.
  • Google News model will be much more efficient than completely taking blogs out of the main Google index - like news, transitioning out blogs into the main stream search after 30 days.
  • De-indexing blogs might lead to other complications like throwing other sites which are built with blogging tools.

Both of these articles are very informative and add to our already "bulging" knowledge about Google but while they make a number of relevant points they miss a number of other points which define the very core of information-retrieval and information-use. Let us see what are those points:

  • Any body who searches through Google - and for that matter - all the other major search engines, cant help but notice that, blogs ARE making extensive noise in the precision and recall of the search results - and note that it is not just true of Google, try searching on Alltheweb or MSN and you would able to sense the noise.

  • Google News like approach is interesting but it is important to understand that most of the blogs (the inter-linking and PageRank subverting type) feed on news, so blog posts mostly contain references to news, to articles, or to other "information sources". Besides, blogs are very close to user-groups because of their meta-level-dialogic nature.

  • When a user searches on the Internet, he is looking for a variety of stuff: news, articles, audio, video, games, images, "web sites", etc. Search engines have already started catering to various categories of searches, so it would make sense in the long run to have separate indexes for each of these sources. It will be useful for the users to choose directly from "websites", news, discussion forums and weblogs rather than wade through one common index for everything.

  • Of course, it will be very difficult to determine the apppropriate unit for de-indexing, for instance what will constitute a "website", a home-page or an index-page of a site?

  • The benefits of de-indexing audio and video from "web-sites" are self-evident, SingingFish, is a case in point.

  • On the other hand, when you are searching in the main Google index you don't want see your results tilted towards say news or blogs- and this is exactly the problem with the 30 day incubation period for Google News.

  • At a more philosophical level, there is still an issue as to how could one differentiate between full-fledged articles written by Bloggers and opinionated news articles culled from various editorial-based online publications.

  • It could be much more interesting (from a user perspective) to see how search engines cross-link between different categories of information sources - for instance, it will much more efficient to be able go to blogs through a news article and vice versa (Technorati-style )

To conclude, it would be correct to argue that categorized search (categorized either on the basis of themes or content types or any other required category) intert-wined with automatic and semantically relevant inter-linkages will help the end users in searching the Internet in a much more meaningfully manner than he/she does today.