Sunday, July 13, 2008

Seeqpod, and the way it is changing my music habits.

Sometimes an idea comes along, and you know it is good because..., well, because it is just so simple that you wonder how come nobody has done it before.

So, here is Seeqpod . Now, on the surface it looks like yet another social network site where you can listen to what your friends are listening, and be a part of a community, and create tags, and so on. Well, it is not. First of all there is nothing explicitly social about it, and the forefront is the music - a lot of music (and some videos) crawled from different resources on the web. Kind of like P2P, but without the part where you actually download anything, so it is kind of legal (I guess, otherwise they would shut it down, wouldn't they?).

And now comes the cool part. You can just search for and instantly listen to any song you like (and Seeqpod has a pretty good coverage), or, if you are like me and you are into listening to full albums you haven't heard in a while, you can create your own playlist - or search for someone else's. So you just type: "Porthishead - Third", and get to listen to the whole album online, seamlessly constructed from songs streamed from god knows where. This is the "social" part - people sharing their playlists, but it's not really social because you just create the list for yourself - to help you organize your own music (well, not really "your own")! And the list is actually useful to other people as well --- since it is an actual album, and not a bunch of random songs tagged with "progressive", or whatever.

Tuesday, May 27, 2008

Writing my own Trends Discovery tool

For a while now, I have been interested in getting some trends statistics: how do concepts and names behave over time? A good example is Google Trends, which has become so authoritative that people use it to measure popularity.

So it would be great to have a Google Trends API, wouldn't it? Well it would, but there isn't one. There was a promise back in December, 2007 to give us one, but that's as close as it gets. And in any case, it would be much more satisfying to write one myself.

So, to make a proof of concept, and to estimate how hard it is to analyze trends, I wrote a few scripts to:
a) Monitor a dozen popular news RSS's on US politics, world news, technology, science and business.
b) Extract named entities (company names, places or people) --- this was done using a great named entity recognizer by Stanford NLP group.

I ran my scripts daily since the beginning of May, and the results so far have been quite interesting: trends extracted from my minuscule news sample seems to match quite well the trends discovered by Google News Trends (this is what is shown in the lower part of the trends graph by Google Trends).

For illustration, below is an example of trends discovered by my method vs. Google's for "obama vs. clinton" trend.

Or another example of "google vs. microsoft". (Note the surge of news volume on Microsoft around May 4th and 18th reflected in both graphs.)

I think there is a lot more to get from these trends than just head-to-head comparison, but these initial trends seem quite promising.

So, note to self: next time there isn't something readily available, perhaps I can do it on my own.








Wednesday, March 26, 2008

How hard is search for you?

Due to the fact that I am not feeling particularly inclined to go back to my regular activities, I finally take a time to write another posting.

One of the most striking papers I have read recently is "Entropy of search logs: how hard is search?", by Mei and Church (published at WSDM'08 conference). It discusses the fundamental question of how big the search index should be, and how hard it is to predict user queries.

Although this paper addresses some very interesting questions, the one that stroke a chord with me in particular was the discussion of the nature of the corpora used for retrieval. It is interesting that although research in IR is very active, this topic , which seems (to me) to be a central issue, is usually set aside. The basic assumptions about the corpora are: (a) it has to be really big (b) the bigger the better. Although Google makes a very good case for these two claims for the web search domain, the question still remains open in other, more topic-focused domains, and assumption (b) would be quite interesting to quantify: namely, how much data do I really need in order to answer specific (type of) questions? And what is the utility (vs. the cost) of any additional piece of information I add to the collection?

I believe the discussion of optimal corpus size will become more and more important as two factors come into play:
  • People will look for finer-grained alternatives for web search (e.g., I know that most of the answers to questions about, say, Python programming language, can be obtained from a handful of trusted sites, so why should I search the whole web for them?
  • Full-text search engines will become a software commodity, just like databases are. What if I have the resources to build my own search engine, how can I go about doing it with minimum resource investment? Are 1,000,000 pages enough? What about 100,000,000,000? Or is the number somewhere in between (closer to the lower end, according to Mei and Church)

Wednesday, August 1, 2007

Social Search

What if we harness the "social networks" phenomena for the web search. Like, using del.isio.us tags for web search. Shenghua Bao from Apex Lab provides some ideas on how this could be implemented. One of his papers discusses usage of tags in language model construction for retrieval, another focuses on creating a social counterpart of PageRank using user tags.

There are probably many other possibilities, like enriching original query words with most query-similar tags, or using tag clouds for discovering key concepts and emerging "hot" topics, but to the best of my knowledge these are still mostly unexplored.

Am I wrong?