Simple demonstration: talking and searching about “Egypt” in China on Sina Weibo

“In accordance to law and regulations, the search results cannot be displayed”

Although that doesn’t prevent you from writing and reposting about Egypt if you want

Banning the chatter on the characters 埃及 (“Egypt”) and banning searches on them are obviously two very different things. Simple demonstration as shown here above graphically.

(Edit 2011-01-31 10AM HKT: Thanks Jonathan for reminding me to write down the translation. And also putting links to the bigger images. In the first image, says that by law and regulations, the results cannot be displayed. In the second image, I test a post where I write the characters for Egypt. On the next morning, it was still there. I also reposted a post from Sina Weibo’s most prominent tech-related microblogger, former Google China chief Kai-fu Lee)

The former is very difficult to do without impairing on the user experience (“what about my leisure trip to Egypt last month”), while the latter is probably just a case of blacklisting characters on the search function. I’ll pull some hard numbers to attempt to show the effect of the ban on chatter online. My guess is none, because who ever uses the search function of a microblog system, seriously?

Starting to graph teh social networks

Pattern of reposting of a post on the Qian Yunhui incident

My boss, Dr. King-wa Fu, wrote code in the R programming language using graph analysis programs (see Fruchterman-Reingold) and generated the graph seen here. This graph retraces the reposting of a particular post on the Qian Yunhui incident/murder in late December 2010 (see one of the reposts, original was deleted — required login).

The red is for female reposters, and the blue is for male ones. We traced the reposting pattern using the references to previous users contained in the reposting text (in Sina, the original post is preserved in its entirety and cannot be modified). While not terribly interesting in terms of colouring (we kept the more interesting ones for more official publications), the graph does help visualise how posts are shared on a social network like Sina Weibo.

(We dream one day that such graph can be generated in real-time and that you could zoom in and out. I’ve seen rather similar products in a previous life when I was working in bioinformatics and playing with 3-D modelling tools and complex molecular model databases.)

On a different topic, with data not published here, we clearly saw that celebrities — people with many followers — play a certain role in driving contents sharing. Many users read those celebrities’ weibo and in turn repost to their networks.

Hopefully, we’ll have more of these to share in the next few weeks…

The (supposed) gender imbalance on the Chinese Weibosphere (updated)

Gender divide on Sina Weibo in Hong Kong

In November, our Hong Kong sample had a larger majority of women than in the mainland sample — the current data more or less shows 72753 women for 27871 men (73% women) is, give or take one percent, the same proportion. I was starting to suspect that this imbalance might be linked to our sampling method, which is based on a sample of posts, for which we grab the users — if women are more frequent updaters, they should appear more frequently, no matter their real distribution in the Weibo population. However, some data that I found may indicate that it is not an accident.

An analysis method that we developed in the past week seems to support this gender imbalance, to my great surprise. I now have a bunch of new scripts that can gather all the reposts of a single post (in Weibo, all reposts/retweets are posts of their own, with a reference to the original) and gather basic information on users. With previous methods, we only had a cross-section of the user base, but this time we have a clearly defined and complete sample of users. We compiled the data for a popular non-entertainment post on the Choi Yuen Chuen (菜園村) issue (login required) in Hong Kong, by a Hong Kong activist and found that it was reposted by 1050 women and 293 men (in Weibo, the gender profile attribute is mandatory and exposed by the API). And 1050 / (1050 + 293) is… 0.782, which means that the reposters are at 78.2% of women.

In the entire sample, Hong Kong and other territories combined, the numbers are 6,211 women and 1,814 men, which is 77.4% [strangely enough for a Hong Kong issue, the majority of reposters were mainland-based, with the most people from Guangzhou (1512) and Shenzhen (411)]. To give more power to these observations on gender, I’d have to look at a larger variety of posts, and actually verify the validity of the gender identification. It would be interesting to see whether these same users who repost are actually people who post a lot of original things, and if they are followed by a comparable number of users. Do women make the user base of Sina Weibo? Does the repost pattern reflect their style of contributions?

Top 500 posts in Hong Kong on Weibo from Nov 1 to Jan 24

Using our sample of Hong Kong users, we compiled a list of the 500 overall most popular posts. You’ll notice an overwhelming proportion of entertainment-related posts. Not shown, but thoroughly observed, is that a large majority of reposters (90-95%) of the most popular posts are women (or rather, we presume, girls). The gender imbalance is at about 72-75% women normally in the Hong Kong sample, and I’ll talk about this in another post this afternoon or tomorrow.

Is it me or is everyone else sick too? Demonstration with Sina Weibo data

View the chart and data tables (may not appear if syndicated on Facebook)

I have a slight sore throat, but nothing to worry about. But my colleagues have been coughing and wheezing most of this week, and I was coincidentally talking about sentiment analysis with my boss yesterday… What makes it easy on Sina Weibo is that emoticons have been codified, such that a happy face is now [呵呵] (“hehe”) and the crying face is a [淚] (crying) [see the Weibo API file]. I ran a database query that summed the number of tweets found containing the sick face [生病] (being ill) emoticon and got the previous graph.

I divided the number of total posts and distinct users by 100 for sake of visibility. I didn’t plot the proportion of users who used the sick emoticon versus the total for sake of clarity, but the raw data is available here.

The blue/red is the line for people who posted the sickness emoticon and sees a steady increase relatively compared with the total number of posts in the week following the new year. There’s also a strange peak on December 16th, but we think it’s maybe sickness towards studying for final exams. 😛

More users in the DB, but still about the same number of active users: Sina Weibo in Hong Kong

More users in the DB, but still about the same number of active users

Our crawling of users in Sina Weibo va bon train, but the number of active users stays relatively the same. To explain how we do the crawling: to get a list of fresh users who are active (a very important criteria), we go about the statuses/search function of the Sina Weibo API to fetch the latest 200 from present time (200 is the max for Sina Weibo, but it much larger for Twitter, namely 3,200 posts or 16 pages of 200 posts). We specify province=81 (Hong Kong) and store any user found in the search (which finds some non-HKers, in case of reposts). We are now about 2,000 short of the magic mark of 100,000 users followed.

Another script made into a daily cronjob updates a list of Sina Weibo Hong Kong users to “check”, which consists in downloading their latest posts using statuses/user_timeline. It takes about 36-48 hours to go through the entire list of 100,000 users, depending of concurrent jobs querying the Sina Weibo being run at the same time for other purposes. We figured we could classify users and make the updating more selective, although that remains to be coded.

One interesting finding is that while the number of users in our database increases, the number of active users does not. If at the first retrieval of posts for a new user we have the latest 200, that means that for certain users the accessible timeline may stretch back to quite a bit, while for others, it may not. It hasn’t been characterized yet, so I can’t tell for sure. Also, the crawling of users is based on a sample from the API, presumably “random” to how users update their microblogs.

I need to gather more infos, and what I find “bizarre” may not be bizarre at all, if say getting the latest 200 posts evens out over the period of two and a half months covered in this graph. But if our sampling had a lot of new users, without older posts, then maybe it’s spam, or they are just very infrequent posters and mainly readers-only.

Need to find something on Sina Weibo? (Using the Qian Yunhui 钱云会 case as an example)

We’re writing something for the Chinese Internet Research Conference and were looking for the first post on Sina Weibo to refer to a particular recent and gruesome incident, the killing of Qian Yunhui (钱云会). 53 year-old Qian was the elected head of Zhaiqiao village outside Yueqing City, near the major port of Wenzhou in Zhejiang, one of China’s rich coastal provinces. His horrible death by being crushed under a construction truck made rounds on the Chinese Internet, but went a little under the radar (for me, at least) because it occurred on Christmas Day. The WSJ talked about it, and it’s reported freely enough on various online domestic media, including the Southern Weekend, Xinhua (video) and Global Times.

Now to do what we are interested in, namely find the first post on Sina Weibo, and see how it went viral from there, we could not use Sina Weibo’s web-based search, because you can’t search by date and old results are naturally pushed to the very bottom (like on Twitter search), or Google, because Sina Weibo requires that you log in with your account in order to see anything beyond the first page. Basically, we had to use the Sina Weibo API to do what we wanted to accomplish.

The API has a statuses search function, which lets users (generally through an app that plugs into Sina Weibo) search on the microblogging network based on criteria such as province, city, with a start and end time (converted in UNIX time).

You must first sign up your Sina Weibo account on the developer network at, and then create a dummy app. This will get you your APPKEY. My end time is set to December 28th at 00:00:00, and I put the count to its maximum of 200 search results. The following search URL looks like this:钱云会&endtime=1293408000&count=200

I won’t reveal our APPKEY, so as a result this url will not actually work. But it basically will give you this file (JSON format). If you are a programmer like I am, you may try to read the file by eye. Otherwise, an online JSON parser might be very helpful.

From there, I now have the earliest posts containing the term “钱云会” (Qian Yun Hui) and make the assumption that nothing was deleted/censored by hand. It’s not an issue affecting the central govt, so my common sense thinks it’s not worth the time of censors to delete posts (but the images are really shocking). I will store the status on our database using the scripts posted on our JMSC HKU’s social media project GitHub, and then perform various analyses, like tracing the reposts.

Incidentally, after playing with the UNIX endtime in the search URL, we find that the earliest post with the term was at 11:37:00 on December 26th, which is about a day after the incident occurred (see JSON file).

This post was not reposted at all and came from an everyday girl. She restated that she heard that Qian Yunhui was held down by some men to be crushed under the truck (听说浙江省乐清市蒲歧镇寨桥有个人叫钱云会(前几年告国家干部贪污没成功被关起来,但是他誓不罢休要继续告发他们),就在寨桥路口,几个人把他活生生地按在地上让浙能电厂的工车压死了。).

This is it for now. We’ll keep you posted as things move ahead.

Somewhere between 3.4 and 3.8

Sina Weibo Nov-Dec 2010

It seems like from the data collected, Sina Weibo users would write between 3.4 and 3.8 status updates per day… But I haven’t looked into median, spread, or taken into account inactive users… The increasing number of entries is also certainly partly due to our increasing number of followed users.