Making sense of online textual information and information management technologies
   
 
Search TechnologiesHome

   
Emerging Face of Information Search Part 5: Information Crawling and Indexing
July 26, 2004

Information Crawling and Indexing: Introduction

So far this series has focused on how several facets of search affect the user of the search results. In order to give completeness to our understanding of information search, we need to pay heed to another very important and crucial facet: information crawling and indexing. The ability of the search engines (online as well as enterprise) to ferret out information from all the nooks and crannies of the Internet and the enterprise network are the very nerves of a search engine. These nerves allow search engines to gather and harvest information to be served up in the search results. Against this backdrop and the race towards bigger indexes started by online search engines, the future of our search experience will depend how search engines could innovate in the areas of information crawling and indexing. This article continues the Emerging Face of Information Search series by asking some pertinent questions about the crawling technologies and processes.

Online Information Indexes: Do Bigger Indexes Always Mean Better Indexes?


Online information indexes have grown by leaps and bounds and the search media has many times delved deeper into the race between search engines to grow their indexes. But nobody seems to be asking the right question. Is this growth commiserating with the growth of online information? An ongoing survey called "How Much Information" at the University of Berkley, California estimated that in 2003, the World Wide Web contained about 170 terabytes of information on its surface. This according to them tantamount to seventeen times the size of the Library of Congress print collections.

Although it is a very simple and straightforward fact that indexes have grown and search engines have improved their capability to crawl the web by using massive and effective use of hardware and processing power (e.g. Google Linux clusters), but it appears that search engines are still lagging behind the very growth of online information. Online information is growing at a much faster pace than the ability of search engines to crawl and index it. So one could argue that the so-called "race" is not between different search engines but between search engine capabilities and the amount of information that is out there.


Here are a few factors that could challenge the theory that assumes that bigger indexes are better indexes:

    1. Many times it appears that even though top pages of a site are crawled the inner pages are missing from the search indexes, and many times search engines seem to not keeping track of what is indexed and what is not. Many times indexes are so volatile that pages keep appearing and disappearing.

    2. Although now most of the search engines have started indexing major file formats, penetration of search engines into these formats is still limited.

    3. The biggest issue among the ones that are listed here, is the search spamming by search engine marketeers and search engine optimizers (SEOs), the example of the recent search engine competition is very pertinent here - increase in pages from 0 - 500k in flat 2 months, just imagine how many pages that are indexed by search engines could be similar spam from webmasters trying to secure higher ranking position in the search results.
    4. There are minor issues like duplicate pages, pages from one site appearing many times over for a search query. Search engines like Google seem to have had good success in tackling this issue but the problem still remains.

So it seems that the quality of indexes in a way has nothing to do with numbers that are being flashed around by search engines. Online search engines will have to start looking seriously at the quality of their indexes rather just bulging them and boasting the numbers for marketing purposes.

Innovations in Information Crawling and Indexing

There are a number of innovations that are likely to take place (or at least being talked about) in the near future that will have an impact on how search engines crawl and index information. One significant idea doing the rounds is the idea of focused crawling and focused indexing of in other words subject specific crawling. As pointed elsewhere on K-Praxis the biggest problem with focused or subject-specific crawling could be that these systems will have to depend on statistical, language-neutral technologies to make them work and since these technologies have had quite an infamous history of not-working rather than working, much more commercial and real-world work is required in this field rather than just academic research.

Another initiative being talked about is the efforts going into the field of indexing the so-called "deep web" and "invisible web", an idea that has really never taken off as this requires going behind public databases and many times there is a question of information holding rights. Making databases available and information extracted from them could be impinging on the copyrights of that information for search engines.

Of course one should not overlook the efforts being made at improving XML, RSS and Atom Feed standardization and inter-polarity, these efforts could revolutionize as well as economize the way information is indexed and crawled. XML and RSS feeds are already making huge inroads into news and blogs crawling and aggregation.

In near future we could see crawling being tackled by smaller players from a completely different angle that allows for requisite size and quality than what we see with big search engine players.

 
Home | Contact K-Praxis | About K-Praxis | Copyright© 2003-2004 K-Praxis. All rights reserved.