This blog is gone elsewhere!

To facilitate the sharing of contents, I’ve decided to move my personal work blog to Tumblr. Thus, The Rice Cooker has now become The Electric Rice Cooker.


Pulling data from Sina Weibo API + the Yihuang self-immolation story

A week or so ago, our director at the JMSC, Prof. Ying Chan, wrote a piece on the ever-changing media landscape in China:

Microblogs, which are limited to 140 characters in length, can be sent from mobile phones or computers. Twitter, the original microblog service, has been blocked in China, but major websites have launched their own Twitter clones, and these have become an important alternative channel for information. It is interesting to note as well that 140 characters in Chinese actually makes for much richer content than the same in English.

(More on the China Media Project website)

To push it a little further, we’ve decided to pull data from Sina Weibo, China’s Twitter equivalent. It may be dubbed a Twitter Clone, but like a lot of online Chinese services that I’ve learned to use in the past year living here, it now comes with “Chinese characteristics” (other than blatant censorship, mind you) such as blog-like commenting.

One thing that seems different from a developer’s point of view: no “immortal” consumer token (so I can’t pull data through OAuth like Twitter requires you to), but still a reliance on basic authentication (so to pull data, I then supply username and password in plain text through a command-line tool called curl). The Sina Weibo API is practically the same as Twitter’s, plus or minus a couple of functions. I’m going to post my code to pull data on our JMSC Github when I find some time to do it.

(Edit (2010-10-26): As promised, I posted source code for data collection tools on the Sina Weibo network.)

The Sina Weibo API allows a much more granular, predictable and decodable location data. For instance, this is my personal info: userinfo-cedricsam.json. My province is “81” and my city is “1”. What does that mean? The answer is in GB 2260 (2007), the Chinese standard for administrative locations. I’m not sure if 81 is the code for Hong Kong in GB 2260, but in ISO 3166-2, Hong Kong is in fact “CN-91”. Nonetheless, it is totally consistent for other locations in Mainland China (Taiwan is “CN-71” in ISO 3166-2). For instance, if you are living in Guangzhou City (广州市), then your data from Sina Weibo API would come out as province=44 and city=1, with the GB 2260 code for Guangzhou City being 440100…

Sina Weibo API also lets you download up to the 9999 last followers or friends of any given user. So, at least using the provincial column, I was able to geocode on a Google Chart (cht=map) only with ISO 3166-2 / GB 2260 provincial codes, all of Ms. Zhong Rujiu‘s (钟如九), one of the persons who self-immolated in Yihuang (宜黄) to plea for the forced requisition of their land. (It’s to be noted that this story has not been censored as far as I know.)

The result is this map:

Sina Weibo – Zhong Rujiu’s last 9999 followers geolocation
Sina Weibo - Zhong Rujiu's last 9999 followers geolocation
URL: http://jmsc.no-ip.org/social/sinaweibo.py/chart?user_id=1819775930

It is not very interesting, notably because of the 9999 limitation. The time of friendship/followership is not indicated either, but the order of it is implied in the friends/following data that you receive (I could have stored this data, but I chose not to at this point).

By a large margin, most users lived in Guangdong (~1400). Second came residents of Beijing (~900). Why? Is it just because of market penetration by Sina in Guangdong?

You might notice that it is a Python script. It in fact acts as a proxy and wrapper for Google Chart API as I create the chart using data pulled on my local database rather than real-time data.

So, from here, we know that we got interesting tools and that Sina Weibo API is in fact as open and accessible as the Twitter API. Because of basic authentication (for the moment, we guess), the programming is also a lot easier to deal with.


China Media Project on Twitter: @cmphku followers around the globe

I uploaded compiled data to Google Fusion Tables on our followers of @cmphku (only the aggregate is shown here for privacy issues), and here’s what it gives. The location data is based on the time_zone column in the Twitter API users/show query.


China Media Map on Google Fusion Tables

http://tables.googlelabs.com/DataSource?snapid=68215

I just discovered Google Fusion Tables. Ten minutes later, I imported the China Media Map and it produced this map. Info windows can even be customized by the user!


Feeding WordPress articles to a Google Map

Soon after the release of the China Media Map last week, we implemented a new feature of the map : geo-tagged events. Unfortunately, there isn’t yet a content management system running behind the map, and just a plain database that isn’t easy to update for our editors.

What we decided upon to feed the map with points of interest ended up being the same WordPress blog powering the CMP site. I used two third-party WordPress plugins, wp-json and WP Geo.

WP Geo was chosen for its user-friendly interface that my colleague David Bandurski uses to geotag articles, with the latitude, longitude, zoom level, etc., stored as numbers (rather than in true spatial representation) in specific WordPress meta tables. wp-json in turn extends the feed system by providing a feed of latest entries. I modified the script a little (my version of wp-json.php) so that it also generates the fields stored by WP Geo, as well as to feed by the category specified in the URL. You should change the “~” that we use in place of the default “category” string used to indicate category archives). For better management of articles made specifically for the map, I created special categories starting with “geo-” and whose articles are not displayed on the regular website. Here’s an example of the geo-events JSON feed.

Screenshot-China Media Map - Events

From the map, you would asynchronously get the JSON file from cmp.hku.hk. But being on a different server and domain name, you would necessarily encounter the client-side cross-domain security problem. The solution? Use a proxy script. I wrote one in Python (proxy.py) that goes as follows:

#!/usr/bin/env python
#-*- coding: utf-8
import sys
import pg
import mod_python
import json
import pycurl
import StringIO

def cmp_jsonfeed(req, p):
	prefix = "http://cmp.hku.hk/~/"
	suffix = "/feed/json/"
	url = prefix + p + suffix
	c = pycurl.Curl()
	b = StringIO.StringIO()
	c.setopt(pycurl.URL, url)
	#c.setopt(pycurl.HTTPHEADER, ["Content-Type: text/javascript"])
	c.setopt(pycurl.WRITEFUNCTION, b.write)
	c.setopt(pycurl.FOLLOWLOCATION, 1)
	c.setopt(pycurl.MAXREDIRS, 5)
	c.perform()
	return b.getvalue()

I tried the best I could to be secure and not allow just everyone to use my script to make proxy requests of anything. Argument “p” sticks itself between specified prefix and suffix. The following link gets the geo-events JSON feed, brings it to the same domain where you’re making the asynchronous javascript call from, thereby bypassing the cross-domain problem:

http://jmsc.no-ip.org/maps/cmp/proxy.py/cmp_jsonfeed?p=geo-events

The rest is just a matter of parsing the JSON feed in JavaScript and creating the markers and other spaces to fill with the data.


Python script for automated geocoding with Google’s Geocoding Web Service

I have a bunch of known addresses or locations as strings (varchars) in my database and I want to store as points in a spatial database. And I also don’t want to query the Google Maps API every time I need to place them on a map. What do I do? I geocode them using the Google Maps API Geocoding Service, and more specifically the Web service directly accessible through a simple URL. Demonstration with two of the supported formats (JSON and CSV):

http://maps.google.com/maps/geo?q=China&output=json&sensor=false

http://maps.google.com/maps/geo?q=China&output=csv&sensor=false

Note: my script uses the Geocoding V2 Web Service, but could be easily adapted to take advantage of V3’s extra features.

Here’s my geocode.py script:

import sys
import pg
import httplib
import time
import string
import urllib
import urllib2

if len(sys.argv) > 1:
	q = ""
	for i in range(1,len(sys.argv)):
		q += " " + string.lower(sys.argv[i])
	q = string.strip(q)
	pgconn = pg.connect('YOUR_DB', '127.0.0.1', 5432, None, None, 'YOUR_DBUSER', 'YOUR_PASSWORD')
	key = 'YOUR_GOOGLE_KEY'
	#path = "/maps/geo?q=%(q)s&gl=ca&sensor=false&output=csv&key=%(key)s" % {'q' : urllib.urlencode(q), 'key' : key}
	china_bounds = '18.0,73.45|53.55,134.8'
	values = {'q' : q, 'key' : key, 'sensor' : 'false', 'output' : 'csv', 'region' : 'cn', 'bounds' : china_bounds}
	data = urllib.urlencode(values)
	headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"}
	conn = httplib.HTTPConnection("maps.google.com")
	try:
		conn.request("GET", "/maps/geo?" + data)
		#print data
	except (Exception):
		sys.exit(sys.exc_info())
	r = conn.getresponse()
	if r.status == 200:
		a = r.read().split(',')
		acc = a[1]
		lat = a[2]
		lng = a[3]
		print '%(status)s: %(latlng)s' % {'status' : r.status, 'latlng' : lng + ',' + lat}
		if lat == "0" and lng == "0":
			sys.exit("Location not found for "%s" " % q)
		wkt = "POINT(" + lng + " " + lat + ")"
		sel = pgconn.query("SELECT * FROM google_geocoding WHERE query = '%(q)s' " % { 'q': q, 'wkt': wkt})
		res = sel.dictresult()
		if sel.ntuples() > 0:
			pgconn.query("UPDATE google_geocoding SET point = ST_GeomFromText('%(wkt)s',4326), fetchdate = NOW() WHERE query = '%(q)s' " % { 'q': q, 'wkt': wkt})
			print "UPDATE: writing over last fetchdate of " + res[0]['fetchdate']
		else:
			pgconn.query("INSERT INTO google_geocoding (query, fetchdate, point) VALUES ('%(q)s', NOW(), ST_GeomFromText('%(wkt)s',4326)) " % { 'q': q, 'wkt': wkt})
			print "INSERT: a new row was added"
	else:
		print 'error: %s' % r.status
else:
	sys.exit("usage: getPoint.py [query]")

This Python script takes a single query (potentially with spaces) as an argument and sends it to the Google Geocoding Web Service. It gets back the result, parses it, and puts it in a database table called google_geocoding, which I use later. It has four columns: a unique id, the query string, the point geometry column, and the timestamp of its last update (I made my table unique on the query string).

In a practical use of this script, on the command-line, I would read from a file containing the strings to query line by line (in a loop), and then send it to the python script. Here’s one improvised:

for i in `cat text_file_with_strings_to_query.txt`
	do echo $i
	python ./geocode.py $i
done

Using MapServer to feed tiles to Google Maps

While working on the CMP‘s China Media Map, I had the chance to use MapServer, an Open Source platform for publishing spatial data and interactive mapping applications on the Web.

We start with a need: display big datasets of geo-tagged data in an efficient manner for the Web. What do you do? We have over 4000 data points corresponding to individual news agencies in China. Almost each of these points (containing an address) were geotagged using Google’s geocoding service (method shown in another entry). All these points are modeled in a Postgis-enabled PostgreSQL database. We use Google Maps as our display platform. How can we display the points?

One of the early versions, which I knew wouldn’t work in a production environment, was to generate all points as markers. Of course, when you are in the lowest zoom level, at the country-level (China, here), 4000 points risks crashing any more or less recent and decent computer. The alternative is to give users a preview of those 4000+ points, without creating the marker object. What do you do? You check this video out first:

This was shown at Google I/O 2009, and the part of interest is at about 26:31 into the video. This is where the presenter talks about tiny markers. How does Google do it? They don’t actually go into details in the video.

MapServer is the tool you need to use. One of its many uses is to help you create your own custom layer. As you might know, Google Maps works by feeding you 256x256px squares of their map data. These tiles are generated by Google’s in-house map tiles server, which gives the following:

http://mt0.google.com/vt/lyrs=m@124&hl=en&x=836&y=446&z=10&s=Ga

But it is possible to produce your own, with data coming from your own spatial database. Google has an outline of how it’s possible to integrate these easily with their version 3 of the Google Maps API (still in Labs).

You may use MapServer, which has binaries available the major platforms (Windows, Mac, Linux). I use Linux, the Ubuntu distro, so all I had to do was set up the UbuntuGIS Personal Package Archive (PPA) in Unstable to get the latest version of all the GIS software you need.

Use a configuration file such as the one I have: news.map

MAP
	DEBUG	5
	CONFIG "MS_ERRORFILE" "/home/csam/geo/mapserver/error.log"
	FONTSET "/home/csam/geo/mapserver/fontset.txt"
	IMAGETYPE	PNG
	#EXTENT	105 15 132 54
	EXTENT	-40.900557 -99.293024 68.194131 174.885971
	#EXTENT	-141.018073 41.676949 -52.582296 89.999427
	SIZE	256 256
	IMAGECOLOR	255 255 255
	TRANSPARENT	ON
	OUTPUTFORMAT
		NAME png
		DRIVER "GD/PNG"
	MIMETYPE "image/png"
		IMAGEMODE RGB
	EXTENSION "png"
	FORMATOPTION "INTERLACE=OFF"
	END
	PROJECTION	"init=epsg:4326"
	END
	WEB
	METADATA
		WMS_SRS "EPSG:4326 EPSG:900913"
	END
	END
	LAYER
		NAME		news_overview
	CONNECTIONTYPE POSTGIS
	CONNECTION "YOUR_OWN_CONNECTION_STRING"
	DATA	"point FROM (SELECT g.gid, g.point, na.name AS placename FROM google_geocoding AS g LEFT JOIN news_agencies AS na ON g.query = lower(na.address) WHERE na.new_id IS NOT NULL UNION SELECT g.gid, g.point, na.name AS placename FROM google_geocoding AS g LEFT JOIN news_agencies AS na ON g.query = lower(na.name) WHERE na.address IS NULL AND na.new_id IS NOT NULL) AS subquery USING UNIQUE gid USING srid=4326 "
	STATUS	ON
	TYPE	POINT
	CLASS
		NAME	"Points"
		STYLE	
		SIZE	3
		COLOR   255 0 0 
		END
	END
	END
	LAYER
		NAME		news
	CONNECTIONTYPE POSTGIS
	CONNECTION "YOUR_OWN_CONNECTION_STRING"
	DATA	"point FROM (SELECT g.gid, g.point, na.name AS placename FROM google_geocoding AS g LEFT JOIN news_agencies AS na ON g.query = lower(na.address) WHERE na.new_id IS NOT NULL UNION SELECT g.gid, g.point, na.name AS placename FROM google_geocoding AS g LEFT JOIN news_agencies AS na ON g.query = lower(na.name) WHERE na.address IS NULL AND na.new_id IS NOT NULL) AS subquery USING UNIQUE gid USING srid=4326 "
	STATUS	ON
	TYPE	ANNOTATION
	CLASS
		NAME	"pimple"
		STYLE
		SYMBOL	"pimple"
		SIZE	10
		COLOR   0 0 0 
		END
			LABEL
				PARTIALS FALSE
				FORCE TRUE
			END
		TEXT ' '
	END
	END
	SYMBOL
		NAME 'pimple'
	TYPE PIXMAP
	IMAGE "google-pimple.png"
	TRANSPARENT 0
	END

	LAYER
		NAME		labels
	CONNECTIONTYPE POSTGIS
	CONNECTION "YOUR_OWN_CONNECTION_STRING"
	DATA	"point FROM (SELECT g.gid, g.point, ST_AsText(g.point) AS labl FROM google_geocoding AS g LEFT JOIN news_agencies AS na ON g.query = lower(na.address) WHERE na.new_id IS NOT NULL ) AS subquery USING UNIQUE gid "
	STATUS	ON
	TYPE	ANNOTATION
	MAXSCALE 100000000000 
	LABELITEM labl
	CLASS 
		LABEL 
		ANGLE auto 
		SIZE 10 
		COLOR 192 0 0
		TYPE truetype 
		FONT arial
		END
	END 
	METADATA
		WMS_SRS "EPSG:4326 EPSG:900913"
	END
	END
END

Point to the mapserv CGI with the appropriate arguments. map is the path to your mapscript configuration file, and mode is tile for this purpose, and tilemode will be gmap for Google Maps (could be Bing Maps and OpenStreetMap too).

This is a link to a tile on the China Media Map:
http://jmsc.no-ip.org/cgi-bin/mapserv?map=/home/csam/geo/mapserver/news.map&layers=news&mode=tile&tilemode=gmap&tile=3+1+2

Both the MapServer and PostGIS documentation sites give you examples of how to write the configuration file. The open source geospatial consortium also has a page for showing you what PostGIS is, and how to use MapServer with it for the Web.


China’s provincial boundaries on CMP Map

Screenshot-China Media Map - Google Chrome

I got the provincial boundaries from Harvard’s CHGIS (China Historical GIS) website. After a free registration, if you go to the datasets section, you may download shapefiles which include the 1997 China provincial boundaries (after the creation of the Chongqing municipality). I used shp2pgsql, and then from my Postgres database, outputted in GeoJSON for use on my test webpage, the upcoming CMP Media Map. The map runs the new V3 version of the Google Maps API. The coordinates fed from the database are used to create the the Polygon object representing the provincial land boundaries.


Importing Hong Kong boundaries shapefiles to a database

For current and mostly future projects, I needed digital geographic boundaries of Hong Kong, per district and lower delineations used by the census bureau of Hong Kong.

First of all, I checked out the Hong Kong 2001 Population Census MAP CD-ROM from our HKU Library, with a special arrangement. The CD can be bought from the Census bureau for HK$840, but I am not sure about its use license. As far as we know, there’s no geographic data for the 2006 census, unless it stayed the same (I didn’t verify this).

Inside the CD-ROM, you find a bunch of files which I think you could install (I didn’t need it, so didn’t try to find out). What’s useful to us is contained in two folders, one called SHP and the other E00. They both contain regular flat files in two of the most popular digital cartography formats. In fact, you find the following in the Readme:

(i) Coastline (COAST)
(ii) District Council District Boundary (DC)
(iii) District Council District (Land Area) Boundary (DC_LAND)
(iv) Constituency Area Boundary (DCCA)
(v) Large Tertiary Planning Unit Group Boundary (TPU_LARGE)
(vi) Small Tertiary Planning Unit Group Boundary (TPU_SMALL)
(vii) Tertiary Planning Unit Boundary (TPU)
(viii) Small Street Block Group Boundary (TPUSB_SMALL)
(ix) New Town Boundary (DIST_NT)

You get files for these boundaries in both formats. I am going to use SHP, or Shapefile, which is ESRI’s format, which contains the primary geographic reference data. E00 is used in MapInfo. We don’t have ArcGIS or any ESRI products in our lab, so we will use a PostgreSQL database with Postgis, everyone’s favourite geographic database extension.

I use an utility command called shp2pgsql to get tables with a geo column with the data in the shapefile. There are also “shp2…” converters for MySQL and for exporting as an image.

I follow the postgis documentation, and here’s a slightly modified command that I use (not all options are necessary):


# shp2pgsql -c -I -W big5-hkscs dc_land.shp dc_land > dc_land.sql
# psql -d jmsc -f dc_land.sql

The -W is required if you want to keep the Chinese characters from attribute columns, which in this case with a district land boundaries gives us the district name in Chinese. The right encoding to use is big5-hkscs, which contains an extra set of Chinese characters used only in Cantonese. I used big5 and it gave me an error mistake when trying to process the Sham Shui Po line (because 埗 for Po is actually a Cantonese-only character).

Once in a database, mash with other data (like census data) and export for visualization apps, such as in KML using libkml for Google Earth.


Experimentations with libkml

As my first geographic database went, so did the scripts that I wrote to fish out the data.

When I started playing with GIS, I initially used a simple MySQL database without support for geographic extension. In fact, I went on to store my points as flat text — needless to say that the data (at least the geographic part of it) was not very useful after the project was finished.

Similarly, the scripts that I used to mash both the geographic data (electoral districts of Canada) and functional data (electoral results) in order to generate KML files were simply written in custom PHP, using the functions for manipulating XML (because KML is a XML-based language).

This was before I discovered libkml.

libkml is library for creating and parsing KML contents. It was first introduced in March 2008 by Google, as a open-source library for reading, writing and manipulating KML. It is written in C++, but thanks to SWIG, you can use libkml in Java and Python.

Map of all media bodies (tv, periodicals, newspapers) in China

I use libkml in Python, but so far only for generating KML. What you see here above is an example of how a KML that I created renders in Google Earth. It shows points of interest representing media outlets in China, in a yet to be released visualization project with my China Media Project colleagues. The idea is this: you have a geographic database with points of interest along with linked data. Then, in python, you fetch from the database and use libkml to output a KML file.

The main advantage that libkml provides is that it simplifies my code greatly, as I don’t need to reinvent the wheel, and that libkml is tailor-made for generating KML.

The trouble now is not so much the actual rendering or visualization part anymore. The challenge, I find, is more to finding quality data, especially of the geographical kind. But that’s going to be for another entry.