This blog is gone elsewhere!

To facilitate the sharing of contents, I’ve decided to move my personal work blog to Tumblr. Thus, The Rice Cooker has now become The Electric Rice Cooker.

Prototype: Completed Buildings in Hong Kong (2005-2011)

Screenshot at 2012-05-03 08:52:01
Screenshot of completed buildings map in Hong Kong (2005-2011)

The following is a map of completed buildings in Hong Kong from 2005 to 2011, according to data from Buildings Department as processed by us (errors may occur): . It is still being worked on as we speak, so bugs might also be found.

Using a text processing tool on Linux, we extracted the text of every section 5.6 of Buildings Department’s PDF monthly digests (here’s one of the 80-something published between 2005 and 2011).

The data was cleaned with Google Refine to the best of our capacities, and mapped with Google Fusion Tables according to the town planning units. We couldn’t do with addresses were often messy, and those in the New Territories often referred to their lot number only.

There are still lots of semi-open data (because not originally in a raw text format) provided by the Hong Kong government that could use a bit of repackaging job. We’ll get back to you soon on this.

Ma Ying-jeou in pictures on WeiboScope

When you check out WeiboScope today, what you will notice are Yao Ming with sleeping delegates at the Shanghai CPPCC and Ma Ying-jeou, Taiwan’s president re-elected for a second 4-year term on Saturday (there’s also this weird meme of Zhou Qifeng, Peking University’s president, grinning uncontrollably alongside Li Keqiang, China’s Vice-Premier).


But what really caught my eye were all the photos sitting at the top of our Sina Weibo data stack, which probably ranges in the thousands when you look at the bottom of the stack. The image wall here above was generated using the image search portion of WeiboScope. There’s one particularly making rounds of Ma in the US with his future wife, Zhou Meiqing.

WeiboScope: image search by keyword


You might have heard of WeiboScope for its display of most important images by a sample of users we selected. “So what?”, some users have asked. WeiboScope is a suite of visualisation tools for an archive of Sina Weibo posts that we collect and store on a local database, which may currently range in the 2-3 million per week.

But the power of WeiboScope is not this particular visualisation (because there are many of them), but rather the data underneath that sustains it. Rather than let Sina Weibo dictate the way the data produced by users should be displayed, we borrow a bit from the open data movement and repackage posts in ways that may be a bit more useful to users. This is how a WeiboScope search by image came to be.

Consider these current use case scenarios:

1- A non-Chinese reader would like to know what the Chinese Weibosphere is now thinking about the death of Kim Jong-il. They can decide to type the Chinese name of Kim Jong-un in the search bar on and find a list of about 25 weibos. But because they are unable to read, they rely only on images. They feel lost, and give up on Weibo (for the day).

2- A person who has a native level of Chinese is doing research on suicide. Some cases are reported to be made viral on the Internet, sometimes because of the fake attention-seeking nature of them, or sometimes because their causes provoke deep societal debates. The researcher searches on the search bar on, finding sometimes irony, and some irrelevant news. It is hard for him to assess the importance of such case with respect to others within a certain period of time.

Now, consider that we had a sample of all Weibos ever produced and that our search engine is neutral as to what gets shown and what does not.

Scenario 1: Using the image search on WeiboScope, you can now find that one of the most popular images used in posts was this one. But then, by visual elimination, you may also notice some more odd pictures such as this one speculating on the younger Kim’s Christmas activities.

Scenario 2: Using the image search on WeiboScope, the researcher searches the word “suicide”. In March 2011, we tried this with an early version of this tool. Just by curiousity, we heard of this schoolchildren suicide case in Fujian through the popular image aggregation. At this point, we only saw one post that made it to viral level. We were curious of the impact of this case on the Chinese Internet, so we searched the characters for “suicide” on the search engine. The result? About 80% of the recent posts with the characters for “suicide” were related to the Fujian case.

The WeiboScope image search demonstrates that when you are allowed to mash and mix, and remix data, it may lead to some discoveries and realizations that may not have been made possible otherwise.

(For non-Chinese writers, the engine supports some automated Google Translate translation! For people searching in Chinese characters, please use quotes around your characters.)

Gephi, like “Photoshop” for graphs

I stumbled across Gephi while browsing through the methodology used by Jonathan Stray, for a project he did at AP on Iraq and Afghanistan war logs.

Gephi is a tool developed primarily in France and co-sponsored by SciencesPo‘s médialab a research center strongly inspired by MIT’s own Media Lab.

I tried it out today and it is surprisingly fast and seems to provide what we need to do graph analysis. I think the next step is to produce data that Gephi natively reads, which is GEXF, an XML format. The data is defined as nodes and edges, and the various plugins written for Gephi may help you layout the graph into something interesting for further analyses, and lets you annotate and do things like color nodes based on their respective distance from each other.

The screenshots and videos should help convince you of the usefulness of the tool for your organization. Since the program is written in Java, it is available in all platforms: Windows, Mac and even Linux.

Next step, read the manual

Have the characters “Egypt” disappeared from the Chinese micro-blogosphere?

To recall the news item I am referring to, many news media like the Wall Street Journal and The New York Times reported a week ago that popular microblogging systems such as Sina Weibo had blocked the “Egypt” characters (埃及) from their search engine.

This had made rounds in my networks and made many people excited about the importance of the news. (One mainstream media even confounded the blocking of search with the blocking of everything…) I have said in a previous post that my intuition was to say that the ban would have no or little effect, simply because microblog users (Twitter included), sporadically used the search function and trusted the people they followed.

Sitting on an archive of weibos (microblog posts) from Sina, I decided to do a quantitative survey on the actual chatter happening around Egypt on China’s most popular microblogging service. My only disclaimer is that while our sample size is large, it is also skewed towards Hong Kong users made out of ordinary citizens.

We followed about 100,000 users in Hong Kong based on a search by province (the location on Sina Weibo is coded). Another 40,000 consisted in users with 1,000 or more followers from all across the network that we gathered through popularity search (most followers in each city/province) or incidentally when researching single posts of interest. So, this gives us a total of 140,000 users or less for which we have followed and downloaded the posts.

I also waited a week after the search ban before compiling results because I wanted to have the most complete archive of posts (it takes about 2-4 days to crawl through the entire list of users we follow). As of today, I can now say that our archive is complete up to last Thursday, February 3rd, 2011.

Overall occurrence of the term “Egypt”

My first piece of data tracks the evolution in the number of posts, reposts and distinct users talking about the specific characters “埃及” (Egypt), along with that of “突尼斯” (Tunisia):

For simplicity, only the total of posts+reposts are shown on the following preview. Clicking on the image get you the full picture.

What were the most reposted stories with the characters for “Egypt”? The answer lies here

This is just to understand what type of chatter we are talking about when it comes to “Egypt”. I did a compilation of the most popular reposts since January 23rd, 2011, when the number of posts on Egypt started to rise on Weibo (at 100+/day from our sample, compared with a baseline of 10-30).

Here are some highlights:

1. [478 reposts] 2/2/2011 1:29:51 中國人這次在埃及機場的確表現的很出色,大使館給滯留的公民都送去了盒飯,裏面的配菜很簡單就是雞蛋,土豆,粉絲加米飯,尤其是剛回國的這些孩子們都把雞蛋省了下來送給機場那些饑腸轆轆的外國人吃。香港媒體評論:這是最好的國家形象片,政府救助及時,公民懂得分享,留給他們的除了感謝還有嫉妒@人傑 (login required) This seems to be a story of Hong Kong citizens stranded in Egypt.

2. [181 reposts] 2/3/2011 0:26:20 半岛台消息:据现场人员报告,埃及博物馆就要被烧毁了。此时此刻,地球上两个地方爆炸声震天,一处在欢庆,一处在怒吼。 (login required) From a netizen named Bob Wang, with little followership (just shy of 600 at this moment), who talks about the looting of the Egyptian Museum in Cairo. He also makes the parallel about the two places in the world at midnight on Feb 3 loudest with explosions, Egypt (because of the uprising) and China (because of the Lunar New Year).

3. [169 reposts] 1/29/2011 10:36:12 埃及开罗骚乱暂平息,坦克开进广场士兵拒绝开枪:1月29日凌晨,在埃及首都开罗,示威者爬上装甲车。 埃及军队的数十辆装甲车与坦克于29日凌晨开入示威者占据的解放广场,士兵表示不开枪,游行示威者也表达了和平意愿,骚乱暂时平息。(新华社) From HK-based Pheonix TV. In Google Translate: Temporarily quell riots in Cairo, Egypt, the soldiers refused to shoot tanks rolled into squares: 29 January morning, in the Egyptian capital Cairo, protesters climbed the armored vehicles. Egyptian army armored vehicles and dozens of tanks in the open into the early morning of 29 demonstrators occupied the Liberation Square, the soldiers said they did not shoot demonstrators also expressed their peaceful intentions, riots temporarily subsided. (Xinhua News Agency)

12. [100 reposts] 2/3/2011 13:26:47 埃及人民舉起中文標語。 (login required) [image] The author is Beifeng, aka Wen Yunchao, a mainland blogger, rights activist and CMP fellow.

N/A. [total 1990 reposts] 1/28/2011 22:28:41 埃及的互联网流量。新闻: (login required) A post by Internet entrepreneur Lee Kai-fu, one of the top 10 most popular users across the Sina network. He talks about the Internet shutdown in Egypt. I had it removed from my list because we specifically targeted this post for further studies, introducing a bias in the repost count via our archiving system.

Number of original posts with the characters for “Egypt”

This is one of the potentially interesting figures that could mean anything until we find a baseline to compare (probably Twitter but just looking at Chinese users). As shown on the graph and in the previous one that took into account reposts, the number of original posts decreased markedly after the January 29th peak, that is the day before the ban on search was reported to be activated.

It would be interesting to see the effect with a more systematic or curated sample of users from all regions of China and around the world. Comparing with other issues and the popularity of a wide range of topics (between “forbidden topics” and more innocuous stuff) would also allow us further insights on this sudden rise and fall in interest. (And it wouldn’t really be called a fall, since we still sustain an average of above 2,000 posts and reposts with the characters for “Egypt”, out of about 120,000 posts per day in the sample of users outside Hong Kong.)

Starting to graph teh social networks

Pattern of reposting of a post on the Qian Yunhui incident

My boss, Dr. King-wa Fu, wrote code in the R programming language using graph analysis programs (see Fruchterman-Reingold) and generated the graph seen here. This graph retraces the reposting of a particular post on the Qian Yunhui incident/murder in late December 2010 (see one of the reposts, original was deleted — required login).

The red is for female reposters, and the blue is for male ones. We traced the reposting pattern using the references to previous users contained in the reposting text (in Sina, the original post is preserved in its entirety and cannot be modified). While not terribly interesting in terms of colouring (we kept the more interesting ones for more official publications), the graph does help visualise how posts are shared on a social network like Sina Weibo.

(We dream one day that such graph can be generated in real-time and that you could zoom in and out. I’ve seen rather similar products in a previous life when I was working in bioinformatics and playing with 3-D modelling tools and complex molecular model databases.)

On a different topic, with data not published here, we clearly saw that celebrities — people with many followers — play a certain role in driving contents sharing. Many users read those celebrities’ weibo and in turn repost to their networks.

Hopefully, we’ll have more of these to share in the next few weeks…

The (supposed) gender imbalance on the Chinese Weibosphere (updated)

Gender divide on Sina Weibo in Hong Kong

In November, our Hong Kong sample had a larger majority of women than in the mainland sample — the current data more or less shows 72753 women for 27871 men (73% women) is, give or take one percent, the same proportion. I was starting to suspect that this imbalance might be linked to our sampling method, which is based on a sample of posts, for which we grab the users — if women are more frequent updaters, they should appear more frequently, no matter their real distribution in the Weibo population. However, some data that I found may indicate that it is not an accident.

An analysis method that we developed in the past week seems to support this gender imbalance, to my great surprise. I now have a bunch of new scripts that can gather all the reposts of a single post (in Weibo, all reposts/retweets are posts of their own, with a reference to the original) and gather basic information on users. With previous methods, we only had a cross-section of the user base, but this time we have a clearly defined and complete sample of users. We compiled the data for a popular non-entertainment post on the Choi Yuen Chuen (菜園村) issue (login required) in Hong Kong, by a Hong Kong activist and found that it was reposted by 1050 women and 293 men (in Weibo, the gender profile attribute is mandatory and exposed by the API). And 1050 / (1050 + 293) is… 0.782, which means that the reposters are at 78.2% of women.

In the entire sample, Hong Kong and other territories combined, the numbers are 6,211 women and 1,814 men, which is 77.4% [strangely enough for a Hong Kong issue, the majority of reposters were mainland-based, with the most people from Guangzhou (1512) and Shenzhen (411)]. To give more power to these observations on gender, I’d have to look at a larger variety of posts, and actually verify the validity of the gender identification. It would be interesting to see whether these same users who repost are actually people who post a lot of original things, and if they are followed by a comparable number of users. Do women make the user base of Sina Weibo? Does the repost pattern reflect their style of contributions?

Is it me or is everyone else sick too? Demonstration with Sina Weibo data

View the chart and data tables (may not appear if syndicated on Facebook)

I have a slight sore throat, but nothing to worry about. But my colleagues have been coughing and wheezing most of this week, and I was coincidentally talking about sentiment analysis with my boss yesterday… What makes it easy on Sina Weibo is that emoticons have been codified, such that a happy face is now [呵呵] (“hehe”) and the crying face is a [淚] (crying) [see the Weibo API file]. I ran a database query that summed the number of tweets found containing the sick face [生病] (being ill) emoticon and got the previous graph.

I divided the number of total posts and distinct users by 100 for sake of visibility. I didn’t plot the proportion of users who used the sick emoticon versus the total for sake of clarity, but the raw data is available here.

The blue/red is the line for people who posted the sickness emoticon and sees a steady increase relatively compared with the total number of posts in the week following the new year. There’s also a strange peak on December 16th, but we think it’s maybe sickness towards studying for final exams. 😛

Sina Weibo and Twitter: Comparing data and conversation structures

Launched in July 2006, Twitter is an online social network that popularized the concept of “microblogs”. Conciseness was defined at 140 characters by the founders of Twitter and this was chosen as a practical way to update one’s Twitter account from a mobile phone, using short message service (SMS).

In terms of characters, conciseness does not have the same meaning in Western alphabet-based languages as it does in Far-eastern ones like Chinese. A Chinese ideogram may be analogous to a word and an amalgam of four of them may often contain as much meaning as a sentence in an European language. The phenomenon of microblogs in China are transforming the landscape of information diffusion in the country, for instance with the Yihuang (宜黄) self-immolation case in September 2010.

Screenshot-李开复的微博 新浪微博-隨時隨地分享身邊的新鮮事 - Google Chrome
Figure 1. Former Google China chief and Internet entrepreneur Kai-fu Lee is one of the ten most popular “Weiboers” in mainland China with 2,472,508 followers in December 2010.

Chinese Twitters

In the midst of the July 2009 Xinjiang unrest, the Chinese government decided to block access to Twitter and Fanfou, then the leading Twitter clone and now recently re-established in mainland China. Internet giants Sina and Tancent (QQ) immediately filed the void and started offering microblog services to their millions of users in mainland China. In November 2010, there were already over 40 million users signed up to the Sina Weibo microblogging service.

Sina Weibo focused its strategy on becoming a text-based broadcast entertainment media, offering exclusive contents from celebrity microbloggers from across the sinosphere.

Basic elements of Twitter and Sina Weibo and the effects on conversation

“Atom” comes from the Greek atomos, which means uncuttable, or indivisible. In the study of microblogs, the atom is the microblog entry, commonly called a “tweet” in Twitter. It is a message that can serve one of three different functions: it may be an ordinary message (tweet), a repetition of another user’s tweet (retweet) or a reply to another user’s tweet.

Instead of using “RT” to signal a retweet, the Sina Weibo user writes “//”, followed with the retweeted user’s name. Behind these vocabulary trivialities, the structure of the group conversation is in fact dramatically different in practice. Users of the American microblogging service often deviate from the adopted syntax (by using “via @somebody”) or employ legacy Twitter clients that may not appropriately mark an entry as a retweet. Sina Weibo instead makes a good case of preserving original postings.

On Sina Weibo’s official interfaces (Web and mobile), the equivalent of a Twitter retweet is instead shown as two amalgamated entries: the original entry and the current user’s actual entry — which is a commentary on the original entry, often with a mention of his sources, if the original entry was obtained from an intermediary.

Screenshot-Detail小S的微博 新浪微博-隨時隨地分享身邊的新鮮事 - Google Chrome Figure 2. This is a status update made by Taiwanese celebrity user Dees Hsu and shown on her user timeline. In this example from Sina Weibo, the user quotes a entry by singer A-Mei Chang (Note 1) who herself retweeted an entry originally written by Liu Hanya (Note 2). The two are distinct Weibo entries, displayed as an amalgamated entry on Dees Hsu’s user timeline, along with comments and number of reposts of Dees’ repost (Note 3). The subtility of this conversation may be lost on Twitter, where the different parts have to be retrieved over several user timelines. (see:

Sina also borrows from its experience as one of China’s biggest blogging service providers by introducing a crucial data structure absent from Twitter: the comment.

Twitter conversation is achieved with combinations of hashtags and search, or with successive replies. On Sina Weibo, these methods are also present but comments add the familiarity and structure to microblog conversation not naturally found on Twitter.

Putting a user entry side by side with the original entry also has the effect of centralizing conversation on one original entry. With Sina’s focus on celebrity users, these VIP users’ timeline may often resemble a television variety show where acquainted characters directly speak and mention each other, while the masses write in the comments section and repost to their networks.

Pictures posted on Weibo are also directly hosted by Sina and directly linked as an entry property, rather than taking up space in the text field. It is now evident that combined with the fact that they are written in Chinese language, entries on Weibo convey a lot more information in a single entry topping 140 characters.

Through the analysis of both networks, Twitter’s model naturally suffers from being the first on the market. While keeping the underlying data structure intact, it may be possible to tweak the user interfaces to change the perception of conversation, which is what some third-party visualisations like Moritz Stefaner’s Revisit attempt to do.

The generic nature of the tweet is Twitter’s most defining characteristic and has shaped the social media’s chaotic but multi-directional conversation, relative to Sina Weibo tendency towards centralization.

Of lists and tags

In attempting to group users, both networks have chosen different strategies. Twitter has lists that a user creates to classify the people he reads into “reading lists”, which may then be shared. Sina Weibo instead has a system of self-assigned tags.

Figure 3. Tags let users classify themselves by interests, age groups, origin, etc.

Based on the names of the lists and co-membership of lists across the network — across different users, that is — Twitter can guess what users are considered to share something in common. The system employed by Sina Weibo groups users by interests, therefore allowing suggestion of other users with a similar or compatible set of potential users to follow. Combined with the set of entries reposted by a Sina Weibo user, it becomes increasingly easy to categorize people into personas.

An outlook on visualisations

Sina Weibo’s more rigid structure for conversation means cleaner and higher-quality data, which is a blessing for future visualisations.# Unlike Twitter, Weibo has also codified the geographic location of its users: on signup, a user must select his province and city (or foreign country only, if living abroad) of residence from a finite list. We could naturally think of a very precise map where the circulation of information is modelled.

Maps detailing social networks may also be drawn using information derived from Sina Weibo’s API. Followers and friends information are evident pieces of data to consider, but relationships may also be quantified by the strength of conversation linking pairs of users, through their number of mutually shared entries, comments and mentions on each other’s timeline.

Sets of similar users by tags and entries reposted, combined with geographical location, can also be of great value to Sina and its commercial partners. From a media studies perspective, it would be invaluable to observe the patterns of reposting and characterize types of individuals instrumental to the propagation of news across the Internet.

In the future, we would envisage interfaces for online social networks to navigate more seamlessly the social interactions, giving us more facility to view, discover and assess the relevance of a piece of information based on the number of references throughout the social network.

This article was written for the Information Architecture and Visualisation course at the Hong Kong Polytechnic University School of Design where the author is pursuing a graduate degree in interaction design. He also works full-time just across Victoria Harbour as a technical researcher in digital media at the Journalism and Media Studies Centre at the University of Hong Kong.

A version of this article appeared on the China Media Project website