China News Archive

Ever looked for an automated archive of Chinese news websites? For months, we’ve been collecting screenshots and HTML snapshots of up to 20 websites based in China or covering China. We now have a webpage for it.

The screenshots are classified by news source and with a minimalistic (if not just minimal) interface, organised by day and regrouped by month. For instance, you could go to the QQ News archive, an archive for February 2011, and a particular link to today.

There’s also a version for accessing navigeable HTML pages when available.

Spam spam spamalot

I’ll start filtering weibos that contain links with and It seems like most containing such links go to spam-like galleries. Bad, bad, mega-bad. The amount of spam is just ridiculous today.

How do you catch and archive deleted posts on Sina Weibo?

It’s the holy grail of any media researcher working on China: how do you quantify contents removal from social media services such as Sina Weibo? Here at JMSC, we’ve been developing tools to scour and assess social media of all sorts for the purpose of researching online media in Hong Kong and mainland China. (This is the same project that generated WeiboScope, for those tuned in.)

Screenshot at 2012-02-08 11:56:04

With the extensive archive of weibos we’ve accumulated so far and mechanisms underlying its retrieval, we were able to develop a routine that finds and marks deleted posts (method explained here).

The result is an archive of deleted posts (and the CMP’s Anti-Social List). Not only is it possible to find a large number (which is not exhaustive, we admit) of such posts within the day of their doom time, but also be able to presume with clear-cut evidence whether these deleted posts were removed by the user itself or simply deleted by system managers (we first noticed the difference in August 2011…).

As described in a post last week, the idea behind this archive is simple and straightforward to implement, once you’ve got the infrastructure.

A previous copy of the user timeline, containing all posts

A current copy of the user timeline, with a missing post

Both copies of a user timeline (post IDs extracted from the full JSON response) are obtained during two consecutive API calls, which may span a few minutes or several hours. The smaller this interval between pollings, the more precise would the routine be in finding the exact removal time (and the chances of missing something, smaller too).

A post is found to be deleted when you could see it in the previous version, but not the current one. Since we keep a copy of every post we see, we simply mark this post, and can then view them all in a custom webpage. Easy enough, right?

Screenshot at 2012-02-08 12:14:35

Screenshot at 2012-02-08 12:14:26

The previous two images show you exactly what, from a programmer’s point of view, Sina Weibo returns us for two different kinds of deleted posts. The former, with an API response “weibo does not exist”, identifies a post that was presumably deleted by the user. The latter, which returns “permission denied”, is presumably a post deleted by the system.

We don’t know the intention between both types of messages, but we can guess based on what their contents are generally. The first are generally made of spam-like posts that would be deleted on any online social network in the world. The second seem to provide more legitimate contents, including some made by so-called VIP users verified by Sina (we only check a 2500-odd sample of users, so can’t really infer on their representation).

This feature is indeed powerful, because it finally puts a number on post removal on Sina Weibo. (We computer science majors strongly dislike conclusions not based on numbers and data.)

It is however currently impossible to tell with certainty what gets deleted (versus what’s not), since our user sample is strongly biased towards public commentators, and perhaps because the number of posts found is still extremely small.

What it does give is an understanding of how post removal works, how much time it usually takes for something to be removed, and whether reposts of posts or just reposts but not original posts, get deleted (in fact, it happens). It’s a privileged peek, indeed, at what’s going on on the Chinese Internets, right here, right now.

Until we accumulate a substantial archive to do anything useful, our colleagues at China Media Project have started compiling and explaining deleted posts on their Anti-Social List.