Tuesday, May 27, 2008

Writing my own Trends Discovery tool

For a while now, I have been interested in getting some trends statistics: how do concepts and names behave over time? A good example is Google Trends, which has become so authoritative that people use it to measure popularity.

So it would be great to have a Google Trends API, wouldn't it? Well it would, but there isn't one. There was a promise back in December, 2007 to give us one, but that's as close as it gets. And in any case, it would be much more satisfying to write one myself.

So, to make a proof of concept, and to estimate how hard it is to analyze trends, I wrote a few scripts to:
a) Monitor a dozen popular news RSS's on US politics, world news, technology, science and business.
b) Extract named entities (company names, places or people) --- this was done using a great named entity recognizer by Stanford NLP group.

I ran my scripts daily since the beginning of May, and the results so far have been quite interesting: trends extracted from my minuscule news sample seems to match quite well the trends discovered by Google News Trends (this is what is shown in the lower part of the trends graph by Google Trends).

For illustration, below is an example of trends discovered by my method vs. Google's for "obama vs. clinton" trend.

Or another example of "google vs. microsoft". (Note the surge of news volume on Microsoft around May 4th and 18th reflected in both graphs.)

I think there is a lot more to get from these trends than just head-to-head comparison, but these initial trends seem quite promising.

So, note to self: next time there isn't something readily available, perhaps I can do it on my own.