July 14, 2004
One of the primary concerns of search engines today, besides understanding the users' intent, is that of ranking information. A search engine application on the web or inside an enterprise needs to match information with the search query. The sheer volume of information that is available makes it pertinent that search results are ranked against the user's query. Following up on our first article in the series (The Emerging Face of Information Search Part 1: Understanding Users' Intention.), this article takes a look at the world of information ranking mainly by analyzing information ranking methods deployed by most popular search engines.
Information Relevance Ranking
We saw in the last article that search engines only deploy simple ways to understand the user queries yet the user comes back feeling great because her expectations are partly limited by what is being offered to her OR her knowledge about how search works. Also, the sheer volume of information on the Internet and inside enterprise networks is so huge that you are bound to get some results back.
On their part, search engines like Google and Teoma (Ask Jeeves) have made significant improvements in information retrieval technologies and processes, attempting to make search a more useful and meaningful experience for the user. But as we will see later in this article, it seems that they are just scratching at the surface of what's possible.
Here are some of the methods/technologies used by online search engines to rank information against a query:
Google: Uses more than 100 methods of ranking information including their trademark PageRank system. This famous and sometimes infamous system tries to be democratic by assessing the popularity of a webpage based on how many other webpages link to it.
In terms of information ranking PageRank - to quote from one of our earlier articles on K-Praxis:"Contextualized Tabbed OR Categorized Indexes and the Future of Search", - system (to a great extent) assumes that the more linked a web page is, the greater is its value. And whatever algorithm Google uses to normalize this effect - to bring in other aspects such as keywords, relatedness of the content and so forth - because the basic system is PageRank, the results that are produced by Google tilt towards a theory where the more "networked" you are the more popular and trustworthy you are.
Teoma/AskJeeves: Teoma uses among other well know techniques, technology based on the Subject-Specific Popularity method to rank web pages. In this method a document is ranked higher because of its affinity to well-recognized expert documents on related subject/topic. Despite attempts by Google to compete with Teoma with its Hilltop algorithm, Teoma still works wonders for many search terms. May be the search engine optimization (SEO) community has not attacked Teoma/AskJeeves as yet as Google the brand, is so powerful for them; or may be it is almost impossible for the SEO community to spam these results with their techniques.
Vivisimo: Vivisimo re-groups, re-organizes and ranks results based on its clustering technology that allow on the fly clustering results from other engines such as MSN, Lycos, Looksmart, Wisenut, Open Directory and Overture. This is an interesting way to rank information allowing users to discover themes and concepts in the pages they are looking for.
Yahoo/MSN: use Inktomi/Yahoo crawling and search technology that organize results based on various factors such as link and domain popularity and keyword analysis. Although not much information is available, Yahoo ranking algorithms possibly use technologies put together from Inktomi, AllTheWeb and Altavista.
Besides ranking algorithms enumerated above, search engines also use the following ways of information ranking
- 1. Text in the Title
2. Key Word Frequency and Density
3. Key Word Positioning
4. Information in the metatags
5. Content Analysis
Information Relevance Ranking and Enterprise Search
Enterprise search engines face slightly different problems and hence have to follow different strategies. Enterprise level documents are mostly longer and denser than web content and do not have the luxury of using any form of link popularity; but they are not as unorganized and unstructured as web pages. Major enterprise search vendors (aka Unstructured Data Management players) like Autonomy, Verity, Inxight, Google Search Appliance, Fast Search & Transfer, etc., use various methodologies to rank document including information clustering, classification and categorization to rank search results.
Information Relevance Ranking: Issues
Despite all the efforts done by both web search engines and enterprise search companies there is still a lot more work that needs to be done before search technologies perfect the art of ordering information against the search terms. As for online search engines this task is much tougher because not only are they engaged in providing quality organic/original results, but they are also engaged in commercial activities and it is/will be difficult for them to make this distinction keeping the relevance of organic/original results intact. Besides, they also face huge problems from spammers and search engine optimizers who are ready to do anything to get better ranking in the search results.
As the commercial buzz around search reaches its crescendo, the relevance of ranking could become the major point in this battle as commercial aspect of search results - especially for the web search - is directly linked to information ranking.
