October 14, 2003
Now it's time to write an in-depth analysis (in true K-Praxis fashion) as the news about Google's acquisition of Kaltix has died down and even Google CEO has spoken about the Google's search vision. Kaltix acquisition's (along with technology from OutRide, a contextual and personalized search startup Google bought last year) might give a fillip to a number of web information retrieval ideas - this article is an attempt to understand such implications.
Google Searches: Present Issues
The discussion about problems and issues regarding Google PageRank are not new, we had analyzed the Google searches and the blog phenomenon some time back on K-Praxis (Blogs and Google : The Future of Categorized Indexes). Let us bring back some of the important points for discussion again:
- Any body who searches through Google - and for that matter - through any other major search engine, cant help but notice that, blogs ARE making extensive noise in the precision and recall of the search results - and note that it is not just true of Google, try searching on Alltheweb or MSN and you would able to sense the noise. So in a way blogs subvert and validate the very essence of PageRank system - ranking the relative authority of webpages is the essence of PageRank.
- Google News like approach for blogs - where you do a separate search for blogs but after a certain incubation period blog posts are incorporate into the main index - is interesting but it is important to understand that most of the blogs (the inter-linking and PageRank subverting type) feed on news, so blog posts mostly contain references to news, to articles, or to other "information sources". Besides, blogs are very close to user-groups because of their meta-level-dialogic nature.
- On the other hand, when you are searching in the main Google index you don't want see your results tilted towards say news or blogs- and this is exactly the problem with the 30 day incubation period for Google News.
- When a user searches on the Internet, he is looking for a variety of stuff: news, articles, audio, video, games, images, "web sites", etc. Search engines have already started catering to various categories of searches, so it would make sense in the long run to have separate indexes for each of these sources. It will be useful for the users to choose directly from "websites", news, discussion forums and weblogs rather than wade through one common index for everything.
- Of course, it will be very difficult to determine the appropriate unit for de-indexing, for instance what will constitute a "website", a home-page or an index-page of a site?
- The benefits of de-indexing audio and video from "web-sites" are self-evident, SingingFish, is a case in point.
- At a more philosophical level, there is still an issue as to how could one differentiate between full-fledged articles written by Bloggers and opinionated news articles culled from various editorial-based online publications.
- It could be much more interesting (from a user perspective) to see how search engines cross-link between different categories of information sources - for instance, it will much more efficient to be able go to blogs through a news article and vice versa (Technorati-style )
- Given this backdrop, it would be quite reasonable to conclude that categorized and personalized search (categorized either on the basis of themes or content types or personalized context of the user) inter-twined with automatic and semantically relevant inter-linkages will help the end users in searching the Internet in a much more meaningfully manner than he/she does today.
| Sales Marketing Intelligence: Is your company looking to buy a Sales or Marketing Intelligence solution? Then its time you analyze the solution from a Text Analysis point of view. A report by K-Praxis on Sales and Marketing Intelligence provides a roadmap for integrating Text Analysis with traditional data mining. The complete report (Sales and Marketing Intelligence: The Need for Integrating Textual Analytics with Traditional Solutions) is available for purchase through InfoSphere AB . |
Google PageRank: The Question of Relative Value of Retrieved Pages
In terms of information retrieval terms, PageRank system (to a great extent) assumes that more linked a webpage is greater its value. Whatever algorithm Google uses to normalize this effect - to bring in other aspects such as keywords, relatedness of the content and so forth - because the basic system is PageRank, the results that are produced by Google tilt towards a theory where more "networked" you are more popular and trustworthy you are. In a way at the beginning of the web information retrieval systems, PageRank served a very important purpose. It gave us one basis to judge web-pages and retrieve them on that basis.
But now, when one is not just interested in the "authority" of the page "judged" by Google PageRank, but relative authority of the page in a more constrained environment where either user's personal choices or a particular field of knowledge/ domain / subject category is involved. So in effect the user is interested judging the page by variety of ways. He/She does not all the time want Google to calculate the "relative authority" of this page across the whole web - in a way that is redundant or of no importance to the user. What he/she is interested in is to rank this page in a constrained search, "if I am interested in Baseball then only rank my results in baseball terms", so on and so forth.
Google + OutRide + Kaltix: Categorized and Personalized PageRank
So having understood the PageRank system vis-a-vis authority mapping web pages, now we can look at what these two startup technologies can bring to Google:
OutRide : OutRide - acquired by Google in 2002, bring in its unique ability to personalize and contextualize searches. So a user's information requirement and information usage (provided that the user gives consent to give out such information or this task is managed without taking any information away from he user!) will decided the relative ranking results.
Kaltix : On the other hand, Kaltix brings the capability of - in effect - putting PageRank on steroids, used with OutRide's personalization and contextulization technology Kaltix technology can not only help Google speed up the PageRank but also help them enter a completely new dimension of web information retrieval.
The Future of Google Searches: Implications for Internet Information Retrieval
The new dimension of these combined efforts could be that the meaning and the philosophy of PageRank itself gets overhauled. Personalized and contextual computation of PageRank could mean you are able to (at any given time) compute three very important aspects of a web page
a) its total rank across the web,
b) its content -through content analysis
c) and, its relative rank within a specific context.
Sounds very interesting!
