Twitterball and Twitterbubble to visualise Twitter data from China using processing.js

I created two visualisations of our Twitter data from China, Twitterball and Twitterbubble. It’s made using processing.js.


Pulling data from Sina Weibo API + the Yihuang self-immolation story

A week or so ago, our director at the JMSC, Prof. Ying Chan, wrote a piece on the ever-changing media landscape in China:

Microblogs, which are limited to 140 characters in length, can be sent from mobile phones or computers. Twitter, the original microblog service, has been blocked in China, but major websites have launched their own Twitter clones, and these have become an important alternative channel for information. It is interesting to note as well that 140 characters in Chinese actually makes for much richer content than the same in English.

(More on the China Media Project website)

To push it a little further, we’ve decided to pull data from Sina Weibo, China’s Twitter equivalent. It may be dubbed a Twitter Clone, but like a lot of online Chinese services that I’ve learned to use in the past year living here, it now comes with “Chinese characteristics” (other than blatant censorship, mind you) such as blog-like commenting.

One thing that seems different from a developer’s point of view: no “immortal” consumer token (so I can’t pull data through OAuth like Twitter requires you to), but still a reliance on basic authentication (so to pull data, I then supply username and password in plain text through a command-line tool called curl). The Sina Weibo API is practically the same as Twitter’s, plus or minus a couple of functions. I’m going to post my code to pull data on our JMSC Github when I find some time to do it.

(Edit (2010-10-26): As promised, I posted source code for data collection tools on the Sina Weibo network.)

The Sina Weibo API allows a much more granular, predictable and decodable location data. For instance, this is my personal info: userinfo-cedricsam.json. My province is “81” and my city is “1”. What does that mean? The answer is in GB 2260 (2007), the Chinese standard for administrative locations. I’m not sure if 81 is the code for Hong Kong in GB 2260, but in ISO 3166-2, Hong Kong is in fact “CN-91”. Nonetheless, it is totally consistent for other locations in Mainland China (Taiwan is “CN-71” in ISO 3166-2). For instance, if you are living in Guangzhou City (广州市), then your data from Sina Weibo API would come out as province=44 and city=1, with the GB 2260 code for Guangzhou City being 440100…

Sina Weibo API also lets you download up to the 9999 last followers or friends of any given user. So, at least using the provincial column, I was able to geocode on a Google Chart (cht=map) only with ISO 3166-2 / GB 2260 provincial codes, all of Ms. Zhong Rujiu‘s (钟如九), one of the persons who self-immolated in Yihuang (宜黄) to plea for the forced requisition of their land. (It’s to be noted that this story has not been censored as far as I know.)

The result is this map:

Sina Weibo – Zhong Rujiu’s last 9999 followers geolocation
Sina Weibo - Zhong Rujiu's last 9999 followers geolocation
URL: http://jmsc.no-ip.org/social/sinaweibo.py/chart?user_id=1819775930

It is not very interesting, notably because of the 9999 limitation. The time of friendship/followership is not indicated either, but the order of it is implied in the friends/following data that you receive (I could have stored this data, but I chose not to at this point).

By a large margin, most users lived in Guangdong (~1400). Second came residents of Beijing (~900). Why? Is it just because of market penetration by Sina in Guangdong?

You might notice that it is a Python script. It in fact acts as a proxy and wrapper for Google Chart API as I create the chart using data pulled on my local database rather than real-time data.

So, from here, we know that we got interesting tools and that Sina Weibo API is in fact as open and accessible as the Twitter API. Because of basic authentication (for the moment, we guess), the programming is also a lot easier to deal with.


Evolution of #manila823 and #manilahostage hashtags on Twitter

Evolution of %23manila823 and %23manilahostage hashtags on Twitter

Like a lot of people today here, I’ve been swept away with the tragedy in Manila involving Hong Kong tourists.

I understood that something was happening on the second time I saw people standing in front of televisions in a consumer electronics store, in the commercial district of Causeway Bay. But I only “got” the news when I went to read my Twitter feed (@cedricsam) and started following confusing reports of reports, each fighting for retweets, and thus, relevance.

Here above is just a simple graph showing the evolution of two hashtags, in number of tweets by two-minute intervals. I had a script prepared in advance, previously used on the reaction of the twittersphere to black rainstorm alerts in Hong Kong.

The story by Bloomberg indicates a “10-hour standoff”, but my data only shows a maximum of 500 tweets (a limitation of the Twitter API).

In the case of #manila823, a popular hashtag for Chinese tweeters, the maximum was attained, meaning that more tweets could potentially be found before 20:00 HKT. A Twitter user based out of Guangzhou, @Doriscafe, seemed to be leading the pack.

For #manilahostage, I retrieved about 300 tweets containing the hashtag, which went past a few times on my English-language Twitter feed.

(The peaks at about 20:45 local time occurred when the gunman was announced dead and when the last hostages were starting to be evacuated from the bus.)

If Twitter data was more searchable in the past (or if we are more systematic in sampling Twitter searches on important events), we could possibly do more in visualising what’s buzzing at any given moment.


And now the readers of @CCHK

After the graphic showing the breakdown by time zone for @cmphku readers, here’s another one showing it for @CCHK readers.

CCHK is Creative Commons Hong Kong, and we at the JMSC are hosting Creative Commons for the HKSAR. The Twitter account was registered in October 2008 and was until the advent of @cmphku, the most closely followed at JMSC.

Notice the high proportion of readers from the Alaska time zone (which does not include Hawaii). And from manual inspection of a screen names sample, most of these users may well not be from Alaska… In fact, it seems like most are based out of China, although I don’t have a precise number/proportion to offer.


China Media Project on Twitter: @cmphku followers around the globe

I uploaded compiled data to Google Fusion Tables on our followers of @cmphku (only the aggregate is shown here for privacy issues), and here’s what it gives. The location data is based on the time_zone column in the Twitter API users/show query.


Visualising HK Transport Department traffic accident data in Google Fusion Tables

Screenshot-Transport Department - Year 2008 - Google Chrome
Step One: Download the data from the Transport Department website at http://www.td.gov.hk/en/road_safety/road_traffic_accident_statistics/2008/index.html. Scroll down and you will find a link to Road Traffic Accident Database 2008.

Screenshot-Google Fusion Tables - Google Chrome
Step Two: Import to Google Fusion Tables. You have to save the XLS file as individual CSVs, since it’ll take only one table at the time, and the number of rows limit is lower for XLS files.

Screenshot-Google Fusion Tables | Vehicles involved in Road Traffic Accidents in 2008 (Hong Kong) - Google Chrome
Step Three: Visualise. Here, we see that an overwhelming proportion of casualties on the road in 2008 involved men (coded as 1 in the data), but it might just be because of demographics.

Because there is not a lot of unique information to plot (like a datetime of the accident), the suggestion with this data is to do an aggregate on your column of interest (say, driver sex), then plot it as the entity, and use the count as your value. Could be nice to mix and match two criteria (are young men more frequently involved in accidents?).

If you want to play with the data yourself, here are the links to the tables, as imported in Google Fusion Tables:

1. Road Traffic Accident Stats in 2008: http://tables.googlelabs.com/DataSource?dsrcid=224727

2. Vehicles involved in Road Traffic Accidents in 2008: http://tables.googlelabs.com/DataSource?dsrcid=225310

3. Casualties in Road Traffic Accidents in 2008: http://tables.googlelabs.com/DataSource?dsrcid=225311

Here is how it compares in terms of age, whether the casualty involved was male or female (note that the scale is different, being much lower for women).


Male driver casualties in 2008 (plotted by age on the x-axis)


Female driver casualties in 2008 (plotted by age on the x-axis)


Overall driver casualties in 2008 (plotted by age on the x-axis)

The current problem with Google Fusion Tables (which is still a Labs product) is that it won’t allow you to compare more than two criteria at the same time in a practical format. For instance, I can’t superimpose graphs for deaths per sex and per age on one single view. Sounds like a pretty basic feature, so I wouldn’t be terribly surprised if it sprung up in a couple of months, if not weeks.

Quality of the data is also questionable since maybe 60-70 people listed as “drivers” are aged 16 or less… Did they mean they were in the driver’s seat or actually driving when the accident occurred?!

***

On another note, I also imported the news agencies database from China’s General Administration of Press and Publication, which is the state agency regulating news and print publication in the PRC. This data was retrieved at around March 2010 from www.gapp.gov.cn using custom scripts systematically reading the GAPP’s webpages. After parsing into a database-friendly format, I used it to build the China Media Map, which might start to include our annotations, soon.

But frankly, there isn’t much to visualise with this data, aside from location, since it has no contextual data attached to it (it’s just an address/phone book, basically). If you can think of something to do with it, drop me a line.


Comparing the response on Twitter to the July 22 and July 28 black rainstorm alerts in Hong Kong

Last week, I did a quick graph of the number of tweets after the black rainstorm alerts. I repeated the procedure for the second black rainstorm alert in a week called by the Hong Kong Observatory. The tweets are grouped in intervals of five minutes and plotted over a course of four hours around when the alert is announced by HKO.

Nb of tweets found on searches for “black rainstorm” (vertical axis maximum is 30 tweets)

July 28nd black rainstorm alert (at 3:35PM)

July 28nd black rainstorm alert (at 3:35PM)

July 22nd black rainstorm alert (at 5:30PM)

July 22nd black rainstorm alert (at 5:30PM)

Of course, the graphs could mean a lot of things… One of my hypotheses (and perhaps most obvious answer) is that the first rainstorm of the season is always more captivating, tweet-worthy, than the second, let alone the third or fourth.

According to the HKO, a black rainstorm alert means: “Very heavy rain has fallen or is expected to fall generally over Hong Kong, exceeding 70 millimetres in an hour, and is likely to continue.”


Yesterday’s black rainstorm alert in Hong Kong on Twitter

Tweets containing “Black rainstorm”

A simple graph showing the number of tweets with search key BLACK RAINSTORM after the black rainstorm alert in Hong Kong (5-minute interval)
At 5-minute interval

A simple graph showing the number of tweets with search key BLACK RAINSTORM after the black rainstorm alert in Hong Kong (2-minute interval)
At 2-minute interval

This is the evolution of the words “black rainstorm” on Twitter, following yesterday’s black rainstorm alert given by the Hong Kong Observatory, something that occurs a handful of times per typhoon season. I got the data from the Twitter Search API, parsed it with a Python script and made a graph using Google Chart API.

It was done in a hurry, but I will try to generalize this method (for future events) when I have the time to do it.


Traffic accidents in Hong Kong in 2009 (a few charts)

This is a project that I am currently working on with Masato Kajimoto, faculty member at the JMSC. A few months ago, we contacted the Hong Kong Transport Department to get some data on road safety in the city.

I originally wanted to get location data on every single accident in Hong Kong, but what we did manage to obtain was all the data for 2009. It is non-personalised data, so all it has was a location, the date-time and the severity (on a scale of 1 to 3).

While TD had the location data for every single accident (fatal or not), it came directly from police reports. Without a standardized way of naming the place of the accident, it has therefore become a nightmare to try and correlate a name of a place with actual coordinates using conventional methods. You have places like Fung Tak Road at the junction of Ying Fung Lane or the slightly better Texaco Road at the junction of Tsuen Wing Street TW New Territories.

(In fact, the location of an accident was often referenced using only with a “chainage” or lamppost identification… Masato asked Highway Department for geo-coded lamppost information and actually obtained it.)

But the easiest visualisations that I am able to make so far have simply been the result of aggregating the data in regular intervals of time, whether to look at the variation per week (more accidents on weeks with public holidays), or at sums per day of the week and time of the day (more accidents later at night, as we approach or are in the weekend).

(Technical note: I wrote a python script that reads my database, and generates links for the Google Chart API.)

Traffic accidents in Hong Kong in 2009 (breakdown per week)

Traffic accidents in Hong Kong in 2009 (breakdown per day)

Traffic accidents in Hong Kong in 2009 (per time of day and day of week)

The last graphic was inspired by this cool project made for visualising the time/day at which tweets are made for any given Twitter account. I found out later that a horizontal scroller could be used to change the date at which the tweets are being observed (and the graphic is regenerated automatically by changing the link of the Google Charts image. Clever!).