Public services shutdown

Unfortunately, we noticed that we’re not optimized to support external requests to our search tool and other tools mentioned in the previous posts. If you need data, please contact Dr. King-wa Fu directly:

Cleaning HK Gov Data with Google Refine and displaying it with Google Fusion Tables

Last week, I started working with data from Buildings Department, concerning building permits.

Despite the PDF documents being “protected” (preventing copying when opening with Acrobat), you can use a common utility for Linux called lesspipe, a pre-processor for less, that can process many file types into readable text.

Readable does not necessarily mean structured. By no means, the lesspipe output is usable as it (it looks like this after separating the sections and aggregating across different PDF files). With the fantastic Google Refine tool, you can however try your best to parse the data, clean the different fields manually and then even perform geocoding inside the tool (with “Add column by fetching URLs”).

After the cleaning was done (it took a few hours last Thursday, and a few more hours today), I did an export in TSV, and sent it to Google Fusion Tables. I customized the map visualisation with the “month” field, and here is the result:

2005-2011 data for “Table 5.2 Buildings for which building authority has issued demolition consent” from Hong Kong Buildings Department’s monthly digests (alpha)

This is not even close to our final product yet, because the Google Maps JavaScript API V3 now lets you add layers from Fusion Tables data. Effectively, it means that you can build Web applications with different kinds of filters (in pull-down menus, etc.) that dynamically change how the data is displayed. The example here above only shows the single view specified inside Fusion Tables by the owner of the table (me). You could take possibly use the ID of the table (3546150) and make your own visualisation.

For now, the data hasn’t been vetted after refining (maybe the govt will provide us with raw data?), so I would recommend using with high caution as to the validity of the data. It should be largely correct, but some data points may not have been geocoded properly, if at all. For this particular data, corrigendums to Buildings Department monthly digests are not yet taken into account.

Here is another Google Refine + Google Fusion Tables trick on Hong Kong government data:

Map for data from “Short Term Tenancy (STT) Tender Forecast” from Hong Kong Lands Department (alpha)

These are the Short Term Tenancy (STT) Tender Forecast from Lands Department. They are the sites for sale on short term tenancy, for a few years, for uses such as car parks. The color code on this custom map is based on the square meters area of each site for sale (from purple 0-1000 sqm to red for 5000+).


A year and half ago, I did a project called Twitterball for one of my classes at PolyU on information architecture. It played on the idea of visualising the frequencies of tweets in time.

I quickly remashed Twitterball into Weiboball, using data from our archive and the search engine built with WeiboScope Search. The result is Weiboball and Weibobubble.

Get all the reposts and comments of a Sina Weibo post: Introducing a new service from JMSC

We belatedly announce a new service to retrieve all the comments or reposts of any given Sina Weibo post. The service will be very useful for researchers who want to study the contents of chatter surrounding any single post. We created a Google Form to submit posts to the system:

A Weibo ID is a 16-digit long numerical identifier. One way to find a post ID is to use one of our tools: WeiboScope and WeiboScope Search both expose the weibo ID). If you found your post via the website, once you are on the single post website (such as this one), go to the source code (press Ctrl-U if you are on Google Chrome) and do a Find (Ctrl-F) on “mid=”. The first digit starting with a 3 you find following “mid=” is your post ID (in the example we gave it’s “3433594570011824”).

Once you have your bunch of IDs, you can paste them, one per line, into the form and wait. The program running on our server will start to collect the posts using the Sina Weibo Open API and send you an e-mail that your job was queued. If the job is successful, you will get a second e-mail that tells you that all is ok, along a link to a zip file with your results in CSV format.

Try it and tell us if it works for you!

(I had the inspiration to write this service from the time when I was working in bioinformatics in the mid-2000s and where those sort of tasks, to find patterns in DNA or protein sequences, etc., normally took more time than a web user can wait for.)

No trending topics, no problem

Another side-effect of the comment system shutdown (and as documented by my colleague David at CMP), was the shutdown of the trending topics.

[Edit (12:30PM): There was no shutdown of trending topics (thanks Charlie of Chinageeks), but I noticed that their weekly trending topics never seem (from visual inspection) to include posts made in the 3 past days. I was also confused by the daily topics, because I wrote the entry before 9AM and only saw the topics of 2 days ago as “yesterday’s” trending topics, skipping one entire day. It’s possible that the trending topics are only released once or a handful of times per day.]

The image embedded here below is a screenshot of the most commented on weibos of the last week on Sina. They are based on the number of comments made on the given post that week, but are all on rather innocuous topics this time.

熱門轉發 新浪微博-隨時隨地分享身邊的新鮮事兒

The following is a screenshot of the page for the most reposted weibos during the last week (original page). Sina counts them based on a week time from today, and a calendar lets you navigate through the archive.

Most reposted last week

It’s not always inoffensive stuff, as sometimes the posts would touch on social injustice and events of political importance, like here. But don’t look for the Bo Xilai and Wang Lijun stuff, because you won’t find any of this. That said, not everything has to be political to be important, and celebrities posts often occupy the microblogosphere of the majority.

So it is a good thing that we are keeping our own trending topics.

We have been making our own index of trending topics for a long while already (more than a year now) and while it chiefly depends on our capacity to collect posts, it has always given us a good indication of what’s *really* trending, among people of slightly greater influence (we have a list of 270,000 people now, with more than 1,000 followers).

We solely look at popular posts based on the number of reposts (among a sample), because it’s not practical to do trending topics based on comments, when you don’t have the capacity to discover popular posts that way.

WeiboScope - published by the JMSC at HKU

Our trending topics come in handy when the comments are completely gone, as it was the case this week on both Sina and QQ weibo, and reportedly most of the other microblogging platforms. WeiboScope is a visual representation of those trending posts, according to us. For instance, there is a strong representation of pictures of buddhist monks, either of what seems to be one of their leaders, and another of what looks like two monks deviating from their monastic life (see archived screenshot).

Also a bit of Yao Ming, cats growing old together and some ridiculous freak incident where a lady fell in a hole in the pavement where hot water pipes had ruptured.

The image is only indicative, as I didn’t check the actual sampling. But because of its consistency over time (in terms of matching Sina’s own trending topics or what I end up seeing in the news), I can believe that this would be what is interesting among a certain group of slightly more influential people. That’s what the chatter’s on right now on Sina Weibo.

No comment

Since Saturday, comments on Sina Weibo, China’s most popular microblogging platform, have been shut down for clean-up all the way until Wednesday, April 4.




Weibos are a bit more “Twitter-esque” now, since comments are a unique feature of Chinese microblogs over their Western counterparts. It’s especially a critical feature, since reposts are often made through a comment first (you may then decide to repost something back to your readers) and allow to aggregate text of the same topic within a single stream.

Tencent had a similar message for their weibo, also one of the most popular ones in the nation:





The networks do not point fingers at a specific target of this “cleanup” in their message to users, but many understand that it is to rid their weibo systems of the chatter on a supposed “coup” in Beijing.

As illustrated in The Economist this week, but understood by any Chinese speaker, 140 Chinese characters are worth a lot more semantic value than in alphabet-based languages.

We’ll try to to monitor the blackout and keep you posted on it.