This blog is gone elsewhere!

To facilitate the sharing of contents, I’ve decided to move my personal work blog to Tumblr. Thus, The Rice Cooker has now become The Electric Rice Cooker.


New attributes in Sina Weibo API’s V2

Sina released a second version of their API about a month ago. It’s good to mention that we have moved some of our crawling scripts along to V2.

Of our interest, the status entity now had the following new attributes that can be used:

  • reposts_count
  • comments_count
  • mlevel

The first two are self-explanatory. The third probably means “maturity level”. We’re happy to get the first two and think it was about time that Sina start giving us exact numbers. To their defense, repost (and comment) numbers on Sina Weibo are much much higher than on Twitter, because status entities are much better preserved on Sina (on Twitter, those attention hoggers just keep re-writing posts to include their names).

  • allow_all_comment
  • avatar_large
  • verified_reason
  • bi_followers_count
  • verified_type
  • lang

It’s to be noted that the last two, verified_type and lang, were not documented yet, and I saw them just this afternoon (and promptly made it be reflected on our scripts). They are self-explanatory. Unlike for Twitter, Sina verification can be of several levels. My Weibo account is “verified”, but just because I was verified to be a JMSC employee (not because I am famous, bah). There are corpo accounts that get a different kind of verification, and there are probably others that I’m not certain about (power users?). “lang” is very interesting. We have mobile clients in English; we have Web interfaces in Traditional (for Taiwan) and Simplified Chinese (for the mainland). So, is Sina really preparing international versions?

***

After several e-mails from the audience, we do acknowledge that we also faced some auth problems, but were also lucky to have started the project early, such we don’t face some of the other problems (such as the need to specify in the devapp whether we are foreigners). I don’t know if it has anything to do with the rate limits you end up getting.

We’ve also had a few problems with the friendship (friends / followers) functions in V1. Those are still there in V2. Namely, the site won’t work with just OAuth. You also had to have cookies (thus a Web browser accessing the API URL, while signed on your Sina account). If you see some inconsistencies, feel free to e-mail us.

We’re submitting our social media project final report this week. So, expect in the next weeks (not months, I hope) that we release the tools we developed in the wild pretty soon. Some of them are already up on our GitHub.


Getting data from the Facebook Graph API + a script

Screenshot-Facebook Developers - Google Chrome

After the last few weeks spent on tweaking the tools for data-grabbing on Sina Weibo and Twitter, we’re now moving on to Facebook. Or going back to it, as this was what I was on before focusing on the microblogs.

Python script to get data from the Facebook Graph API:
https://github.com/JMSCHKU/Social/blob/master/facebook.graph.py

Facebook has a very interesting API, which it dubs its Graph API. It basically models all entities of Facebook, let them be users (people), events, pages, groups, but also stuff like your links or even Facebook messages “mail” as nodes of Mark Zuckerberg’s wacky vision of organising the world’s social information as points and arrow, or rather vertices and edges in graph theory. It’s people and things and their connections, sitting in a database and rendered as the interface of Facebook.com or on your mobile device or more eventually (a sky map of Facebook, anyone?).

Basically, just like for Twitter and Sina Weibo’s open APIs, you can use Facebook’s API to access the same information you would be able to see as a normal user and perhaps retrieve it to your local storage for future use, for comparisons over time or other analysis that are more practical to do on your local server. Only this time, you can automate the process and do it without involving web crawlers — those are not tailor-made, and the API provides a single standard format, JSON.

Without even logging in, you can already get information our of searches on the Graph API (see Openbook’s experiment). For instance, a search on Hong Kong gives you this: http://graph.facebook.com/search?q=Hong%20Kong. If you knew my Facebook username, then you could point to the Graph API and get my basic info: full name, username, locale, gender (http://graph.facebook.com/cedricsam). All of this is not very special: it’s just the same data that you could get on a search on Google without logging in to Facebook.

What becomes powerful is when you get an access token (associated with a dummy app that you presumably created), after following the instructions on the documentation, which allows you to navigate the same data as a logged in user. It’s becoming interesting, because you can then navigate data like membership to groups (how about co-membership?) or let’s say the popularity of a link that was shared across Facebook by different users. That’s very interesting for social network analysis.

In any case, I started writing a Python script to try to manage the different use cases of the Graph API to either output to CSV or store in a PostgreSQL database. I am posting here a first almost-vanilla version of this on our GitHub:
https://github.com/JMSCHKU/Social/blob/master/facebook.graph.py


Sina Weibo data-grabbing tools for Linux

I’ve written some quick tools for grabbing basic data from Sina Weibo‘s API, clearly the most popular of the “Chinese Twitters”. You can get them here:

JMSCHKU Social on GitHub.com

Using these same tools, I managed to produce this mini-survey of Zhong Riqin’s followers across China in less than an hour.

In this version of the tools, you can get the latest statuses (limited to 200 by Sina), user info, friends and followers (both limited to the last 9999 ones — namely #0-5000 and 4999-9999).

For those who are not familiar with Sina Weibo, it’s quickly evolving past the stage of simply being a Twitter clone, providing novel interface innovations such as the ability to make blog-style comments. Another cool thing about Sina Weibo? Commitment to be open.

The API is almost similar to Twitter’s aside from Sina idiosyncrasies. Also, Sina Weibo still provides basic authentication, on top of OAuth. So, in plain language, it means that you can do data-pulling just with a simple username and password, rather than use tokens that need to be generated, etc. That’s why I can afford to have a sinagetter.sh, a Bash shell script.

The repository also has my Twitter-related code, but most of it hasn’t been updated for a while. I just made a new version of twitter.users.py, a script that grabs user info by unique Twitter ID.


We are now on GitHub!

Naturally, to share code, there is no equal to GitHub, a Web-based hosting service for code using the popular Git revision control. In plain language, it means that the code that I am writing, mostly for the JMSC nowadays, will be available on GitHub.

To start things, I put some of the Python scripts that I wrote for our online social media research project: http://github.com/JMSCHKU/Social/

For those who are unfamiliar with GitHub, it has been ubiquitous whenever I needed source code (mostly for compiling into usable programs). I assume that many people already know of SourceForge, the open-source code repository started at the turn of the century. GitHub innovates compared with SF.net by decentralizing version control.

For starters,


Python code to get tweets through Twitter API

I’m just posting the current version of a series of Python scripts that I’ve been using for the past week or so to fetch data through the Twitter API using OAuth. Since the beginning of September, Twitter has required developers to authenticate through OAuth to access certain of its functions, including the very useful statuses/user_timeline. This function gets you the tweets (statuses) from a user timeline, up to 3200, split up by 200 per page.

I particularly optimized the script so that it doesn’t make you go through the limit of 350 authenticated requests to the API and make you miss some tweets. You can adjust the sleep times if you want, but it’s been pretty reliable so far for me: I’m making up to 16 calls per each of 1000 users in my list. Sleep times were also adjusted to not bombard Twitter, and also to try a same request again if it failed, up to a certain number of times (changeable in the code) until the program stops, writes a complaint to the CSV output and exits (particularly for option 4 and 5, which I spent more time developing).

Here’s the link. This function is “option 5” in my scripts bundled up in twitter.oauth.py.

Option 4 is to do the same search but by user_id, which may not be as easy to find (you have to look in the Atom feed URL). You need to replace some values in the script, like the app id and tokens, etc. Option 3 fetches user info by screen name and stores to a DB (4 and 5 only writes to a CSV).

This script works on UNIX, and hasn’t been tested on Windows, but probably won’t work unless some minor customizations for file I/O, I think.


Python script for automated geocoding with Google’s Geocoding Web Service

I have a bunch of known addresses or locations as strings (varchars) in my database and I want to store as points in a spatial database. And I also don’t want to query the Google Maps API every time I need to place them on a map. What do I do? I geocode them using the Google Maps API Geocoding Service, and more specifically the Web service directly accessible through a simple URL. Demonstration with two of the supported formats (JSON and CSV):

http://maps.google.com/maps/geo?q=China&output=json&sensor=false

http://maps.google.com/maps/geo?q=China&output=csv&sensor=false

Note: my script uses the Geocoding V2 Web Service, but could be easily adapted to take advantage of V3’s extra features.

Here’s my geocode.py script:

import sys
import pg
import httplib
import time
import string
import urllib
import urllib2

if len(sys.argv) > 1:
	q = ""
	for i in range(1,len(sys.argv)):
		q += " " + string.lower(sys.argv[i])
	q = string.strip(q)
	pgconn = pg.connect('YOUR_DB', '127.0.0.1', 5432, None, None, 'YOUR_DBUSER', 'YOUR_PASSWORD')
	key = 'YOUR_GOOGLE_KEY'
	#path = "/maps/geo?q=%(q)s&gl=ca&sensor=false&output=csv&key=%(key)s" % {'q' : urllib.urlencode(q), 'key' : key}
	china_bounds = '18.0,73.45|53.55,134.8'
	values = {'q' : q, 'key' : key, 'sensor' : 'false', 'output' : 'csv', 'region' : 'cn', 'bounds' : china_bounds}
	data = urllib.urlencode(values)
	headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"}
	conn = httplib.HTTPConnection("maps.google.com")
	try:
		conn.request("GET", "/maps/geo?" + data)
		#print data
	except (Exception):
		sys.exit(sys.exc_info())
	r = conn.getresponse()
	if r.status == 200:
		a = r.read().split(',')
		acc = a[1]
		lat = a[2]
		lng = a[3]
		print '%(status)s: %(latlng)s' % {'status' : r.status, 'latlng' : lng + ',' + lat}
		if lat == "0" and lng == "0":
			sys.exit("Location not found for "%s" " % q)
		wkt = "POINT(" + lng + " " + lat + ")"
		sel = pgconn.query("SELECT * FROM google_geocoding WHERE query = '%(q)s' " % { 'q': q, 'wkt': wkt})
		res = sel.dictresult()
		if sel.ntuples() > 0:
			pgconn.query("UPDATE google_geocoding SET point = ST_GeomFromText('%(wkt)s',4326), fetchdate = NOW() WHERE query = '%(q)s' " % { 'q': q, 'wkt': wkt})
			print "UPDATE: writing over last fetchdate of " + res[0]['fetchdate']
		else:
			pgconn.query("INSERT INTO google_geocoding (query, fetchdate, point) VALUES ('%(q)s', NOW(), ST_GeomFromText('%(wkt)s',4326)) " % { 'q': q, 'wkt': wkt})
			print "INSERT: a new row was added"
	else:
		print 'error: %s' % r.status
else:
	sys.exit("usage: getPoint.py [query]")

This Python script takes a single query (potentially with spaces) as an argument and sends it to the Google Geocoding Web Service. It gets back the result, parses it, and puts it in a database table called google_geocoding, which I use later. It has four columns: a unique id, the query string, the point geometry column, and the timestamp of its last update (I made my table unique on the query string).

In a practical use of this script, on the command-line, I would read from a file containing the strings to query line by line (in a loop), and then send it to the python script. Here’s one improvised:

for i in `cat text_file_with_strings_to_query.txt`
	do echo $i
	python ./geocode.py $i
done