Pulling data from Sina Weibo API + the Yihuang self-immolation story

A week or so ago, our director at the JMSC, Prof. Ying Chan, wrote a piece on the ever-changing media landscape in China:

Microblogs, which are limited to 140 characters in length, can be sent from mobile phones or computers. Twitter, the original microblog service, has been blocked in China, but major websites have launched their own Twitter clones, and these have become an important alternative channel for information. It is interesting to note as well that 140 characters in Chinese actually makes for much richer content than the same in English.

(More on the China Media Project website)

To push it a little further, we’ve decided to pull data from Sina Weibo, China’s Twitter equivalent. It may be dubbed a Twitter Clone, but like a lot of online Chinese services that I’ve learned to use in the past year living here, it now comes with “Chinese characteristics” (other than blatant censorship, mind you) such as blog-like commenting.

One thing that seems different from a developer’s point of view: no “immortal” consumer token (so I can’t pull data through OAuth like Twitter requires you to), but still a reliance on basic authentication (so to pull data, I then supply username and password in plain text through a command-line tool called curl). The Sina Weibo API is practically the same as Twitter’s, plus or minus a couple of functions. I’m going to post my code to pull data on our JMSC Github when I find some time to do it.

(Edit (2010-10-26): As promised, I posted source code for data collection tools on the Sina Weibo network.)

The Sina Weibo API allows a much more granular, predictable and decodable location data. For instance, this is my personal info: userinfo-cedricsam.json. My province is “81” and my city is “1”. What does that mean? The answer is in GB 2260 (2007), the Chinese standard for administrative locations. I’m not sure if 81 is the code for Hong Kong in GB 2260, but in ISO 3166-2, Hong Kong is in fact “CN-91”. Nonetheless, it is totally consistent for other locations in Mainland China (Taiwan is “CN-71” in ISO 3166-2). For instance, if you are living in Guangzhou City (广州市), then your data from Sina Weibo API would come out as province=44 and city=1, with the GB 2260 code for Guangzhou City being 440100…

Sina Weibo API also lets you download up to the 9999 last followers or friends of any given user. So, at least using the provincial column, I was able to geocode on a Google Chart (cht=map) only with ISO 3166-2 / GB 2260 provincial codes, all of Ms. Zhong Rujiu‘s (钟如九), one of the persons who self-immolated in Yihuang (宜黄) to plea for the forced requisition of their land. (It’s to be noted that this story has not been censored as far as I know.)

The result is this map:

Sina Weibo – Zhong Rujiu’s last 9999 followers geolocation
Sina Weibo - Zhong Rujiu's last 9999 followers geolocation
URL: http://jmsc.no-ip.org/social/sinaweibo.py/chart?user_id=1819775930

It is not very interesting, notably because of the 9999 limitation. The time of friendship/followership is not indicated either, but the order of it is implied in the friends/following data that you receive (I could have stored this data, but I chose not to at this point).

By a large margin, most users lived in Guangdong (~1400). Second came residents of Beijing (~900). Why? Is it just because of market penetration by Sina in Guangdong?

You might notice that it is a Python script. It in fact acts as a proxy and wrapper for Google Chart API as I create the chart using data pulled on my local database rather than real-time data.

So, from here, we know that we got interesting tools and that Sina Weibo API is in fact as open and accessible as the Twitter API. Because of basic authentication (for the moment, we guess), the programming is also a lot easier to deal with.

6 responses to “Pulling data from Sina Weibo API + the Yihuang self-immolation story”

  1. John Dennehy says:

    Thanks for posting this. We’re also working with the Weibo api and it is proving to be very straight forward to work with. Hopefully it won’t get over baked in the years to come.

  2. jon says:

    one question about Sina. is it possible to search for a particular keyword and collect the results via the API? or to ask for the messages posted by a given user?


    • Cedric Sam says:

      Yes you can. Basically all the functions in the Twitter API should be available as well in the Sina Weibo API. For one thing, I know that those two functions you mention are available. One is search and the other is user_timeline.

  3. Drew says:

    Hi, I have one question that hope you can answer. Do you know how can I access their search by keyword API now? seems like they put some controls on that API in this new version.

Leave a Reply