Sina Weibo deleted posts archive

Since Sina Weibo has a pretty good API, and since we do download lots of data every day, it just makes good sense to keep an archive of deleted posts.

The strategy is very straightforward and only incurs a negligible extra number of hits against the API:
– Take the statuses/user_timeline function for each user in your list (we have 2,500 in a sub-list).
– Extract the IDs of all 200 posts in the response and save as a text file, one ID per line. They are already ordered chronologically.
– You should have a previous list of IDs. Use diff to compare both files.
– Loop through the output of the diff. Mark all the IDs that appear in the previous version, but not the new one.
– Those IDs are the deleted posts.
– Mark them, and send your alert, etc. (We also hit the API again on statuses/show to double-check if the post was really deleted.)
– Overwrite the old ID list with the new one.
– Repeat whenever you can fetch a new version of the timeline (you might be rate-limited by Sina if you do it too often).

Searching our Sina and QQ Weibo archive

Screenshot at 2012-01-19 17:15:45

We had a search engine built a while ago for Sina Weibo archive, and since yesterday, also for the QQ Weibo archive. We use Lucene as the indexer (to do quick full-text searches) and then store all linked information in our standard database. The difference with the real search engines provided on the Sina and QQ Weibo websites is that we don’t currently implement any weighing, and the results are just everything we got, ordered by publication date.

We index at every four hours, so there’s at least a 30 minutes delay, and at most around 4 hrs 30 minutes. There’s paging, too. Because we’re not Google, be sure to understand that queries normally take up to 1 minute to run (more if there’s lots of activity on the server). The search by region / province on the Sina search is also uber-slow.

Cool feature: you can link directly to searches! For instance, if you were interested in racing celebrity Han Han (韩寒) who has been under fire recently, you may use a link such as these:韩寒韩寒

Other cool feature: Google Translate! Write your search query in your language, and behind the scenes, we’ll try to send a query to the Google Translate API. You’ll know whether it worked when you get your results.

Ma Ying-jeou in pictures on WeiboScope

When you check out WeiboScope today, what you will notice are Yao Ming with sleeping delegates at the Shanghai CPPCC and Ma Ying-jeou, Taiwan’s president re-elected for a second 4-year term on Saturday (there’s also this weird meme of Zhou Qifeng, Peking University’s president, grinning uncontrollably alongside Li Keqiang, China’s Vice-Premier).


But what really caught my eye were all the photos sitting at the top of our Sina Weibo data stack, which probably ranges in the thousands when you look at the bottom of the stack. The image wall here above was generated using the image search portion of WeiboScope. There’s one particularly making rounds of Ma in the US with his future wife, Zhou Meiqing.

Le bogue de l’an 2012

In case people noticed, the quality of our WeiboScope declined quite a bit towards the end of last week. It was just caused because of the passage to the first ISO week of 2012 (which started on Monday). Consequently, only most popular posts made in 2011 counted. We didn’t lose anything, and things are back in track.

