One of the most striking papers I have read recently is "Entropy of search logs: how hard is search?", by Mei and Church (published at WSDM'08 conference). It discusses the fundamental question of how big the search index should be, and how hard it is to predict user queries.
Although this paper addresses some very interesting questions, the one that stroke a chord with me in particular was the discussion of the nature of the corpora used for retrieval. It is interesting that although research in IR is very active, this topic , which seems (to me) to be a central issue, is usually set aside. The basic assumptions about the corpora are: (a) it has to be really big (b) the bigger the better. Although Google makes a very good case for these two claims for the web search domain, the question still remains open in other, more topic-focused domains, and assumption (b) would be quite interesting to quantify: namely, how much data do I really need in order to answer specific (type of) questions? And what is the utility (vs. the cost) of any additional piece of information I add to the collection?
I believe the discussion of optimal corpus size will become more and more important as two factors come into play:
- People will look for finer-grained alternatives for web search (e.g., I know that most of the answers to questions about, say, Python programming language, can be obtained from a handful of trusted sites, so why should I search the whole web for them?
- Full-text search engines will become a software commodity, just like databases are. What if I have the resources to build my own search engine, how can I go about doing it with minimum resource investment? Are 1,000,000 pages enough? What about 100,000,000,000? Or is the number somewhere in between (closer to the lower end, according to Mei and Church)