Who are the most popular Sina Weiboers ?

As of Oct 29 2010, the answer lies here: http://tables.googlelabs.com/DataSource?dsrcid=292639. With the techniques described on another blog post, I’ve done pulls from the Sina Weibo API by province and city, by order of followers count. I made sure that each region (province code alone or province + city codes) had users with as little as 100 followers. For sake of simplicity, I made a summary of the 100 most followed users on Sina Weibo.

Sina Weibo Top 100 Most Followed Users (including administrative accounts):

Link to Top 100 + Entire data set (20,000+ rows)

Sina, the portal, actually assembled a China Daily feature published on June 17 2010 describing the state of the Weibosphere. In comparing Twitter with Sina Weibo the article quoted iReseach.cn blogger Liu Xingliang saying that Sina was expectedly far less political than its American counterpart. Sina Weibo also had two business people in its Top 10, which Twitter did not have.

The Top 10 changed a little bit, but still saw mainland actress Yao Chen (姚晨) with about 3.3 million followers sitting at the very top, followed by Taiwanese actress Dee Hsu (徐熙娣) and mainland-born Zhao Wei (赵薇).

We suspect that some of these numbers may be a little influenced by this celebrity suggestion directory (has an “add-all” button at the top-right of each block).

Don’t look for Sina Weibo pages from JMSC staff: they are all on QQ’s microblog system. CMP director Qian Gang for instance had close to 1.8 million followers as of yesterday.

Sina Weibo data-grabbing tools for Linux

I’ve written some quick tools for grabbing basic data from Sina Weibo‘s API, clearly the most popular of the “Chinese Twitters”. You can get them here:

JMSCHKU Social on GitHub.com

Using these same tools, I managed to produce this mini-survey of Zhong Riqin’s followers across China in less than an hour.

In this version of the tools, you can get the latest statuses (limited to 200 by Sina), user info, friends and followers (both limited to the last 9999 ones — namely #0-5000 and 4999-9999).

For those who are not familiar with Sina Weibo, it’s quickly evolving past the stage of simply being a Twitter clone, providing novel interface innovations such as the ability to make blog-style comments. Another cool thing about Sina Weibo? Commitment to be open.

The API is almost similar to Twitter’s aside from Sina idiosyncrasies. Also, Sina Weibo still provides basic authentication, on top of OAuth. So, in plain language, it means that you can do data-pulling just with a simple username and password, rather than use tokens that need to be generated, etc. That’s why I can afford to have a sinagetter.sh, a Bash shell script.

The repository also has my Twitter-related code, but most of it hasn’t been updated for a while. I just made a new version of twitter.users.py, a script that grabs user info by unique Twitter ID.

Twitterball and Twitterbubble to visualise Twitter data from China using processing.js

I created two visualisations of our Twitter data from China, Twitterball and Twitterbubble. It’s made using processing.js.

Pulling data from Sina Weibo API + the Yihuang self-immolation story

A week or so ago, our director at the JMSC, Prof. Ying Chan, wrote a piece on the ever-changing media landscape in China:

Microblogs, which are limited to 140 characters in length, can be sent from mobile phones or computers. Twitter, the original microblog service, has been blocked in China, but major websites have launched their own Twitter clones, and these have become an important alternative channel for information. It is interesting to note as well that 140 characters in Chinese actually makes for much richer content than the same in English.

(More on the China Media Project website)

To push it a little further, we’ve decided to pull data from Sina Weibo, China’s Twitter equivalent. It may be dubbed a Twitter Clone, but like a lot of online Chinese services that I’ve learned to use in the past year living here, it now comes with “Chinese characteristics” (other than blatant censorship, mind you) such as blog-like commenting.

One thing that seems different from a developer’s point of view: no “immortal” consumer token (so I can’t pull data through OAuth like Twitter requires you to), but still a reliance on basic authentication (so to pull data, I then supply username and password in plain text through a command-line tool called curl). The Sina Weibo API is practically the same as Twitter’s, plus or minus a couple of functions. I’m going to post my code to pull data on our JMSC Github when I find some time to do it.

(Edit (2010-10-26): As promised, I posted source code for data collection tools on the Sina Weibo network.)

The Sina Weibo API allows a much more granular, predictable and decodable location data. For instance, this is my personal info: userinfo-cedricsam.json. My province is “81” and my city is “1”. What does that mean? The answer is in GB 2260 (2007), the Chinese standard for administrative locations. I’m not sure if 81 is the code for Hong Kong in GB 2260, but in ISO 3166-2, Hong Kong is in fact “CN-91”. Nonetheless, it is totally consistent for other locations in Mainland China (Taiwan is “CN-71” in ISO 3166-2). For instance, if you are living in Guangzhou City (广州市), then your data from Sina Weibo API would come out as province=44 and city=1, with the GB 2260 code for Guangzhou City being 440100…

Sina Weibo API also lets you download up to the 9999 last followers or friends of any given user. So, at least using the provincial column, I was able to geocode on a Google Chart (cht=map) only with ISO 3166-2 / GB 2260 provincial codes, all of Ms. Zhong Rujiu‘s (钟如九), one of the persons who self-immolated in Yihuang (宜黄) to plea for the forced requisition of their land. (It’s to be noted that this story has not been censored as far as I know.)

The result is this map:

Sina Weibo – Zhong Rujiu’s last 9999 followers geolocation
Sina Weibo - Zhong Rujiu's last 9999 followers geolocation
URL: http://jmsc.no-ip.org/social/sinaweibo.py/chart?user_id=1819775930

It is not very interesting, notably because of the 9999 limitation. The time of friendship/followership is not indicated either, but the order of it is implied in the friends/following data that you receive (I could have stored this data, but I chose not to at this point).

By a large margin, most users lived in Guangdong (~1400). Second came residents of Beijing (~900). Why? Is it just because of market penetration by Sina in Guangdong?

You might notice that it is a Python script. It in fact acts as a proxy and wrapper for Google Chart API as I create the chart using data pulled on my local database rather than real-time data.

So, from here, we know that we got interesting tools and that Sina Weibo API is in fact as open and accessible as the Twitter API. Because of basic authentication (for the moment, we guess), the programming is also a lot easier to deal with.

A new tweet-gathering tool + sample with #liuxiaobo and 刘晓波 as search words

We’ve developed a simple tool that uses the Twitter API to collect tweets onto our local database at the JMSC. We have initially selected a list of 1000 China-based tweeters, removed those with private profiles (which are impossible to gather), and proceeded to download their 3200 last tweets (the maximum allowed by the API), starting a few weeks ago. For some, it meant tweets as old as 2008, and in most cases at least well into 2009. And since this week, I’ve started to continuously collect the tweets four times a day.

The tool has a web interface, but it is not yet ready to be released to the public because of load issues and other unresolved questions. Twitter is notoriously bad at providing meaningful search, because of the large volume of users. But since we are keeping track of only about 1000 tweeters, and most of them from a list of “influential” tweeters, we hope that we can give more sense to this particular (and authoritative) slice of the Chinese Twittersphere. We hope to eventually have data of a more “popular” Hong Kong Twittersphere, the focus of our research (the problem is how to select a sample of the public, which makes the latter more difficult).

For now, I can provide CSV files of individual search queries if you send an e-mail to me at cedsam@hku.hk. The files may look like the following CSV files with #liuxiaobo and 刘晓波 as search words.