Le bogue de l’an 2012

In case people noticed, the quality of our WeiboScope declined quite a bit towards the end of last week. It was just caused because of the passage to the first ISO week of 2012 (which started on Monday). Consequently, only most popular posts made in 2011 counted. We didn’t lose anything, and things are back in track.

Featured today on WeiboScope:
The rumoured coup in North Korea
36 years since the death of Zhou Enlai
Some corrupt officials at the D & G in Hong Kong?


WeiboScope: image search by keyword

weibosearch

You might have heard of WeiboScope for its display of most important images by a sample of users we selected. “So what?”, some users have asked. WeiboScope is a suite of visualisation tools for an archive of Sina Weibo posts that we collect and store on a local database, which may currently range in the 2-3 million per week.

But the power of WeiboScope is not this particular visualisation (because there are many of them), but rather the data underneath that sustains it. Rather than let Sina Weibo dictate the way the data produced by users should be displayed, we borrow a bit from the open data movement and repackage posts in ways that may be a bit more useful to users. This is how a WeiboScope search by image came to be.

Consider these current use case scenarios:

1- A non-Chinese reader would like to know what the Chinese Weibosphere is now thinking about the death of Kim Jong-il. They can decide to type the Chinese name of Kim Jong-un in the search bar on Weibo.com and find a list of about 25 weibos. But because they are unable to read, they rely only on images. They feel lost, and give up on Weibo (for the day).

2- A person who has a native level of Chinese is doing research on suicide. Some cases are reported to be made viral on the Internet, sometimes because of the fake attention-seeking nature of them, or sometimes because their causes provoke deep societal debates. The researcher searches on the search bar on Weibo.com, finding sometimes irony, and some irrelevant news. It is hard for him to assess the importance of such case with respect to others within a certain period of time.

Now, consider that we had a sample of all Weibos ever produced and that our search engine is neutral as to what gets shown and what does not.

Scenario 1: Using the image search on WeiboScope, you can now find that one of the most popular images used in posts was this one. But then, by visual elimination, you may also notice some more odd pictures such as this one speculating on the younger Kim’s Christmas activities.

Scenario 2: Using the image search on WeiboScope, the researcher searches the word “suicide”. In March 2011, we tried this with an early version of this tool. Just by curiousity, we heard of this schoolchildren suicide case in Fujian through the popular image aggregation. At this point, we only saw one post that made it to viral level. We were curious of the impact of this case on the Chinese Internet, so we searched the characters for “suicide” on the search engine. The result? About 80% of the recent posts with the characters for “suicide” were related to the Fujian case.

The WeiboScope image search demonstrates that when you are allowed to mash and mix, and remix data, it may lead to some discoveries and realizations that may not have been made possible otherwise.

http://research.jmsc.hku.hk/social/obs.py/sinaweibo/#search

(For non-Chinese writers, the engine supports some automated Google Translate translation! For people searching in Chinese characters, please use quotes around your characters.)


Google+ API crawler in Python and a few remarks to start with

We’ve started working on tools to crawl the newly released Google+ API. I got an e-mail notifying us of the availability of the API on September 16th. I think we’re the first ones to write third-party tools to download and cache some of the data.

I’ll post the database schema later when they’re more stable.

For now, the API is read-only, and we’re limited to a 1000 requests/day limit. Since it is a first release, I was keen on collecting, in case the terms would change.

The API is interestingly minimalistic: People, Activities and Comments are the three data types you can search, list and get. There are many other types of data, but they are attached to the aforementioned. For instance, a “People” can have several organisations, urls, placesLived and emails, although I don’t think the latter is available with the current version of the API.

As People are concerned, you may also get a hasApp (for the mobile app, we guess), languagesSpoken (an array of string) and even an intriguing currentLocation (Latitude/Maps integration, someone?). It’s interesting, but it’s also scary, from a user’s point of view, how much publicly accessible information there is.


The trouble with popular users…

Screenshot at 2011-11-17 09:47:06

At some point in our research project, it was a good idea to take all the users with more than a certain arbitrary large number of followers (say, 1000) and download their posts and analyze them. This doesn’t always seem to be the case anymore. Results are variable depending on the days.

We are set to release WeiboSphere, but will wait a little before pushing it. Right now, we’re taking every user with 1000 or more followers and get all their recent posts from the API. We aggregate and produce an unfiltered (at least not with a human filter) classification of the most popular posts by 24 hours, 48 hours, week, two weeks and one month.

Alas, in the last two days, all we’re seeing are female body parts, shoes and celebrities who returned to an incredibly thin size after a pregnancy.

The hope for now, until we improve the filters, is that we can see posts such as this one on an abducted girl in Guangzhou, posted yesterday morning.


Spawn more overlords?

Lucene -> Daemon

One of the biggest challenges in the project has been I/O. Throughout the networks that we check, we deal with large amounts of data that we need to write and read at every moment.

Lucene is a quick way to search through text, including that in Chinese language. We used to rely on the database to do this, but it turned out quickly to be terribly inefficient. To do a search, you had to visit every row (within the parameters given) and search for whether a term appeared.

We asked our HKU colleagues in the computer science department for help, namely Reza Sherkat, a former IBM employee, now a post-doctoral fellow with Nikos Mamoulis. He had previously given us advice on inverted indexes, which in a nutshell uses tokens of text (from the weibos, say) as keys in a gigantic array. The values in each element are what were the indexes in the table or type of objects that we are indexing (for weibos, it would be the weibo ID).

So, when you search a word, you effectively only go through a list of unique words/tokens, which returns a bunch of weibo IDs.

The second trick Reza told us about was the use of programs running in the background, or what are commonly called “daemons”. Like daemons, they are always there, waiting for a program to call it. A use we could make (or should make) would for instance be to keep a list of user IDs in memory. If you want to know whether a weibo was made by a user, no need to go to the database to check. You can do all of that in memory.

There are probably some more clever uses, such as for counting or going through large numbers of items.

It is known for instance that for Google and Facebook to achieve their levels of efficiency that all the data that passes through in fact just stays in memory. And the problem with memory is that it requires an electric current to stay alive. A power outage (which we think should never happen) and the data dies.

Operating in memory (in RAM, that is) is much much faster than having to fetch from a disk. It should make a difference, and we shall try it on our 48Gb of RAM.


New attributes in Sina Weibo API’s V2

Sina released a second version of their API about a month ago. It’s good to mention that we have moved some of our crawling scripts along to V2.

Of our interest, the status entity now had the following new attributes that can be used:

  • reposts_count
  • comments_count
  • mlevel

The first two are self-explanatory. The third probably means “maturity level”. We’re happy to get the first two and think it was about time that Sina start giving us exact numbers. To their defense, repost (and comment) numbers on Sina Weibo are much much higher than on Twitter, because status entities are much better preserved on Sina (on Twitter, those attention hoggers just keep re-writing posts to include their names).

  • allow_all_comment
  • avatar_large
  • verified_reason
  • bi_followers_count
  • verified_type
  • lang

It’s to be noted that the last two, verified_type and lang, were not documented yet, and I saw them just this afternoon (and promptly made it be reflected on our scripts). They are self-explanatory. Unlike for Twitter, Sina verification can be of several levels. My Weibo account is “verified”, but just because I was verified to be a JMSC employee (not because I am famous, bah). There are corpo accounts that get a different kind of verification, and there are probably others that I’m not certain about (power users?). “lang” is very interesting. We have mobile clients in English; we have Web interfaces in Traditional (for Taiwan) and Simplified Chinese (for the mainland). So, is Sina really preparing international versions?

***

After several e-mails from the audience, we do acknowledge that we also faced some auth problems, but were also lucky to have started the project early, such we don’t face some of the other problems (such as the need to specify in the devapp whether we are foreigners). I don’t know if it has anything to do with the rate limits you end up getting.

We’ve also had a few problems with the friendship (friends / followers) functions in V1. Those are still there in V2. Namely, the site won’t work with just OAuth. You also had to have cookies (thus a Web browser accessing the API URL, while signed on your Sina account). If you see some inconsistencies, feel free to e-mail us.

We’re submitting our social media project final report this week. So, expect in the next weeks (not months, I hope) that we release the tools we developed in the wild pretty soon. Some of them are already up on our GitHub.


What an inconsistent API…

The Sina Weibo API will not always return you what you expect. The web interface is one way to access the data posted on Weibo, but the API (application programming interface) is what programmers and applications will talk to. If a post is gone (for whatever reason) from the Weibo website, is it also really gone from Weibo’s databases and gone if you wanted to access it through the API?

Take the example of post #3351528048407216, made by a user called 北京徐晓, an author living in Beijing. It was posted on Monday night, just passed midnight.

The microblogger wrote: “太好玩了,党报《光明日报》网站发表文章,抨击骆家辉轻车简从的背后,是资本主义及西方价值观的渗透,是美国的“新殖民主义”、“文化殖民主义”的体现。恼羞成怒挂不住了不如直接说,何必这么牛头不对马嘴的瞎拽呢?”. It google-translates to “Too much fun, the party newspaper “Guangming Daily” Web site published an article criticizing Locke pomp behind the Western values ​​of capitalism and the infiltration of America’s “new colonialism”, “cultural colonialism” is all about. Angry embarrassing as a direct say, irrelevant of the blind so why pull it?”

The article (our snapshot) was one of the most popular in the last 24 hours. Traces of the post cannot be found on the user’s timeline (see screenshot): there is now a gap between 00:37 and 00:48, whereas the post was made at 00:46 on August 29th.

The following screenshot shows how it now appeared on the site of one of the users (we counted 27,536 posts in our archive so far) who reposted the post in the meanwhile:

Screenshot-天涯赵瑜的微博 新浪微博-隨時隨地分享身邊的新鮮事 - Google Chrome

The message on the website is “該微博已被刪除”, which is “This Weibo has already been deleted” (example here). It’s different from the message “此微博已被原作者刪除”, which is “This Weibo has been deleted by the user” (example here), and which may also appear on your timeline when a post you reposted was deleted by the original user (but your message remains intact).

What is it now, if you take the ID of the post (3351528048407216) and query it against the API (link, may not work if not logged in Weibo)? You get that the post is still accessible from a programmers’ standpoint:

EDIT 2012-02-02: It seems like Sina has changed the deleted posts error message. From the normal website, self-deleted and presumable system-deleted posts are indistinguishable now. But if you look at deleted posts through the API (using the statuses/show function), they are definitely not: a self-deleted post says “weibo does not exist” and the system-deleted posts says “permission denied”. We just started investigating different deleted posts through a fully automated method.

Screenshot-api.t.sina.com.cn-statuses-show-3351528048407216.json?source=4280451947 - Google Chrome


List of all Sina Weibo users with 1000+ followers

I compiled an updated list of all users with 1000 or more followers on Sina Weibo:
http://www.google.com/fusiontables/DataSource?dsrcid=1170398

The list is hosted on Google Fusion Tables. It is sorted by number of followers, but you could re-sort by whichever column. You may also filter and search. There are currently 108,341 users across the network with 1000+ followers.


Google+ in Greater China

This morning, I stumbled upon a website listing the users with the largest following (in circles) so far of Google+. Mark Zuckerberg is at the top, with some 35,000 now, followed by a series of top Google execs such as Larry and Sergey.

The Top 50 contains many American techies and celebrities, but also a sizeable complement of Chinese Internet notables. The highest ranked Chinese celebrity is a blogger and software engineer named William Long (月光博客) at #19. He is trailed by Valen Hsu (許茹芸), a Taiwanese singer, who are circled by 5,000 people so far, good for #20.

Screenshot-許茹芸 - Google+ - Google Chrome
Taiwanese singer Valen Hsu is 20th among the most circled users on Google+

Prominent blogger Hecaitou (和菜头) is currently at #33. Another Chinese blogger, keso.me, is #45. Dahui Feng, a well-known tech commentator, is #49.

In the meanwhile, reports of Google+’s death in China may have been greatly exaggerated.


You’ve got mail! (from the Sina Weibo admins)

Screenshot-我的通知 新浪微博-隨時隨地分享身邊的新鮮事 - Google Chrome

We seldom get internal mail from Sina Weibo, but twice it happened within the same day, this Tuesday. What was it for? The first was apparently to dispel rumours of nuclear winds from the Fukushima I nuclear accidents spreading in the Philippines, presumably before they were to reach China (the rumours, not the radiation), and the second was a forecast of the weather in regions of Japan potentially affected by the nuclear accident.