This blog is gone elsewhere!

To facilitate the sharing of contents, I’ve decided to move my personal work blog to Tumblr. Thus, The Rice Cooker has now become The Electric Rice Cooker.

Searching our Sina and QQ Weibo archive

Screenshot at 2012-01-19 17:15:45

We had a search engine built a while ago for Sina Weibo archive, and since yesterday, also for the QQ Weibo archive. We use Lucene as the indexer (to do quick full-text searches) and then store all linked information in our standard database. The difference with the real search engines provided on the Sina and QQ Weibo websites is that we don’t currently implement any weighing, and the results are just everything we got, ordered by publication date.

We index at every four hours, so there’s at least a 30 minutes delay, and at most around 4 hrs 30 minutes. There’s paging, too. Because we’re not Google, be sure to understand that queries normally take up to 1 minute to run (more if there’s lots of activity on the server). The search by region / province on the Sina search is also uber-slow.

Cool feature: you can link directly to searches! For instance, if you were interested in racing celebrity Han Han (韩寒) who has been under fire recently, you may use a link such as these:韩寒韩寒

Other cool feature: Google Translate! Write your search query in your language, and behind the scenes, we’ll try to send a query to the Google Translate API. You’ll know whether it worked when you get your results.

Starting on GWT

One of the discoveries made at Google I/O was the Google Web Toolkit (or GWT). I’m currently starting to learn how to use it to build a tool to curate our various data on Chinese media and potentially other projects.

I used to use Yahoo! User Interface (or YUI), which is not bad at all, but just a totally different model of Web user interface development.