Gephi, like “Photoshop” for graphs

I stumbled across Gephi while browsing through the methodology used by Jonathan Stray, for a project he did at AP on Iraq and Afghanistan war logs.

Gephi is a tool developed primarily in France and co-sponsored by SciencesPo‘s médialab a research center strongly inspired by MIT’s own Media Lab.

I tried it out today and it is surprisingly fast and seems to provide what we need to do graph analysis. I think the next step is to produce data that Gephi natively reads, which is GEXF, an XML format. The data is defined as nodes and edges, and the various plugins written for Gephi may help you layout the graph into something interesting for further analyses, and lets you annotate and do things like color nodes based on their respective distance from each other.

The screenshots and videos should help convince you of the usefulness of the tool for your organization. Since the program is written in Java, it is available in all platforms: Windows, Mac and even Linux.

Next step, read the manual


The (supposed) gender imbalance on the Chinese Weibosphere (updated)

Gender divide on Sina Weibo in Hong Kong

In November, our Hong Kong sample had a larger majority of women than in the mainland sample — the current data more or less shows 72753 women for 27871 men (73% women) is, give or take one percent, the same proportion. I was starting to suspect that this imbalance might be linked to our sampling method, which is based on a sample of posts, for which we grab the users — if women are more frequent updaters, they should appear more frequently, no matter their real distribution in the Weibo population. However, some data that I found may indicate that it is not an accident.

An analysis method that we developed in the past week seems to support this gender imbalance, to my great surprise. I now have a bunch of new scripts that can gather all the reposts of a single post (in Weibo, all reposts/retweets are posts of their own, with a reference to the original) and gather basic information on users. With previous methods, we only had a cross-section of the user base, but this time we have a clearly defined and complete sample of users. We compiled the data for a popular non-entertainment post on the Choi Yuen Chuen (菜園村) issue (login required) in Hong Kong, by a Hong Kong activist and found that it was reposted by 1050 women and 293 men (in Weibo, the gender profile attribute is mandatory and exposed by the API). And 1050 / (1050 + 293) is… 0.782, which means that the reposters are at 78.2% of women.

In the entire sample, Hong Kong and other territories combined, the numbers are 6,211 women and 1,814 men, which is 77.4% [strangely enough for a Hong Kong issue, the majority of reposters were mainland-based, with the most people from Guangzhou (1512) and Shenzhen (411)]. To give more power to these observations on gender, I’d have to look at a larger variety of posts, and actually verify the validity of the gender identification. It would be interesting to see whether these same users who repost are actually people who post a lot of original things, and if they are followed by a comparable number of users. Do women make the user base of Sina Weibo? Does the repost pattern reflect their style of contributions?


More users in the DB, but still about the same number of active users: Sina Weibo in Hong Kong

More users in the DB, but still about the same number of active users

Our crawling of users in Sina Weibo va bon train, but the number of active users stays relatively the same. To explain how we do the crawling: to get a list of fresh users who are active (a very important criteria), we go about the statuses/search function of the Sina Weibo API to fetch the latest 200 from present time (200 is the max for Sina Weibo, but it much larger for Twitter, namely 3,200 posts or 16 pages of 200 posts). We specify province=81 (Hong Kong) and store any user found in the search (which finds some non-HKers, in case of reposts). We are now about 2,000 short of the magic mark of 100,000 users followed.

Another script made into a daily cronjob updates a list of Sina Weibo Hong Kong users to “check”, which consists in downloading their latest posts using statuses/user_timeline. It takes about 36-48 hours to go through the entire list of 100,000 users, depending of concurrent jobs querying the Sina Weibo being run at the same time for other purposes. We figured we could classify users and make the updating more selective, although that remains to be coded.

One interesting finding is that while the number of users in our database increases, the number of active users does not. If at the first retrieval of posts for a new user we have the latest 200, that means that for certain users the accessible timeline may stretch back to quite a bit, while for others, it may not. It hasn’t been characterized yet, so I can’t tell for sure. Also, the crawling of users is based on a sample from the API, presumably “random” to how users update their microblogs.

I need to gather more infos, and what I find “bizarre” may not be bizarre at all, if say getting the latest 200 posts evens out over the period of two and a half months covered in this graph. But if our sampling had a lot of new users, without older posts, then maybe it’s spam, or they are just very infrequent posters and mainly readers-only.


The gender imbalance on the Chinese Weibosphere

Gender divide on Sina Weibo in Hong Kong (n=53,821)
3 Sina Weibo users out of 4 in Hong Kong are women

Gender divide on Sina Weibo across the network (n=539,274)
57% of Sina Weibo users (China+World) are women, just like it is on Twitter, actually

“Sina Weibo users living in HK: 40,268 women女, but only 13,553 men男? Around the world: 307,916女, 231,358男. Is the API or my sample wrong?”

– From my personal Twitter

There might be more men than women in China, but that is just not the case on the Sina Weibo online social network. Just like location, gender is a required field when you sign up for a Sina Weibo account, and while doing a simple database query the other day, I found this interesting statistic: 3 Weibo users out of 4 in Hong Kong were women, and the proportion of women in Weibo across the world is of 57%, a difference of 14 percentage points.

We collected data on the users using Sina’s public search API, so these are active users over the course of about a month. While the Hong Kong sample seemed mysteriously skewed, the China+World one sample is a lot more expected, at 57% of women, versus 43% of men, which is consistent with data for Twitter.

The China+World sample comprised of a big Hong Kong sample (the focus of our main research) and of the 400 or more most popular users by followers in each province/city of China and world regions (a lot more for big Chinese cities like Beijing, Shanghai, Chongqing, etc.). Even if we didn’t count the Hong Kong sample of about 55,000 users, the gender imbalance is still notesworthy: 267,648女Women (55.13%), 217,805男Men (44.87%).

David McCandless of Information is Beautiful did a nice graphic of the gender imbalance in Western online social media.


Who are the most popular Sina Weiboers ?

As of Oct 29 2010, the answer lies here: http://tables.googlelabs.com/DataSource?dsrcid=292639. With the techniques described on another blog post, I’ve done pulls from the Sina Weibo API by province and city, by order of followers count. I made sure that each region (province code alone or province + city codes) had users with as little as 100 followers. For sake of simplicity, I made a summary of the 100 most followed users on Sina Weibo.

Sina Weibo Top 100 Most Followed Users (including administrative accounts):

Link to Top 100 + Entire data set (20,000+ rows)

Sina, the portal, actually assembled a China Daily feature published on June 17 2010 describing the state of the Weibosphere. In comparing Twitter with Sina Weibo the article quoted iReseach.cn blogger Liu Xingliang saying that Sina was expectedly far less political than its American counterpart. Sina Weibo also had two business people in its Top 10, which Twitter did not have.

The Top 10 changed a little bit, but still saw mainland actress Yao Chen (姚晨) with about 3.3 million followers sitting at the very top, followed by Taiwanese actress Dee Hsu (徐熙娣) and mainland-born Zhao Wei (赵薇).

We suspect that some of these numbers may be a little influenced by this celebrity suggestion directory (has an “add-all” button at the top-right of each block).

Don’t look for Sina Weibo pages from JMSC staff: they are all on QQ’s microblog system. CMP director Qian Gang for instance had close to 1.8 million followers as of yesterday.


A new tweet-gathering tool + sample with #liuxiaobo and 刘晓波 as search words

We’ve developed a simple tool that uses the Twitter API to collect tweets onto our local database at the JMSC. We have initially selected a list of 1000 China-based tweeters, removed those with private profiles (which are impossible to gather), and proceeded to download their 3200 last tweets (the maximum allowed by the API), starting a few weeks ago. For some, it meant tweets as old as 2008, and in most cases at least well into 2009. And since this week, I’ve started to continuously collect the tweets four times a day.

The tool has a web interface, but it is not yet ready to be released to the public because of load issues and other unresolved questions. Twitter is notoriously bad at providing meaningful search, because of the large volume of users. But since we are keeping track of only about 1000 tweeters, and most of them from a list of “influential” tweeters, we hope that we can give more sense to this particular (and authoritative) slice of the Chinese Twittersphere. We hope to eventually have data of a more “popular” Hong Kong Twittersphere, the focus of our research (the problem is how to select a sample of the public, which makes the latter more difficult).

For now, I can provide CSV files of individual search queries if you send an e-mail to me at cedsam@hku.hk. The files may look like the following CSV files with #liuxiaobo and 刘晓波 as search words.