Android Ice Cream Sandwich on a Pandaboard

I got myself a Pandaboard a few weeks ago. It’s a community-supported fan-less board sponsored by Texas Instruments. It sports their OMAP 4430 chip, which is the same CPU as found in the Kindle Fire, Motorola Droid RAZR and a bunch of other smartphones (the Galaxy Nexus has its successor, the 4460).

So, when I first got it, I installed Ubuntu Linux (Linaro) on it. Used it very little, and sort of figured out the basics, such as installing an Apache webserver, but finally noticing that it lacked support for things I wanted to try out, such as Google Video Chat (which is not yet available for an ARM architecture, the one commonly found in most smartphones today).

So, I instead followed instructions on a YouTube video from the Pandaboard website that said you could install Android 4.0. And turns out you could, by following the instructions (you can find clearer instructions on the Web). So, now I have Android 4.0 Ice Cream Sandwich on the board… Next step is to figure out how to get (or just wait for) the Google Apps (Gmail, Gtalk, etc.), and support for basic hardware such as video and audio capture.

In terms of media and journalism, there is perhaps some potential to create new ways to interact with information, by plugging a projector and some sensors to detect human input. In computing power, the Pandaboard is probably as powerful as a top of the line smartphone, yet at a much lower price of US$178 (but then, so flashy touchscreen). The form factor is interesting for embedded systems, which is something I only discovered this year.


Google+ API crawler in Python and a few remarks to start with

We’ve started working on tools to crawl the newly released Google+ API. I got an e-mail notifying us of the availability of the API on September 16th. I think we’re the first ones to write third-party tools to download and cache some of the data.

I’ll post the database schema later when they’re more stable.

For now, the API is read-only, and we’re limited to a 1000 requests/day limit. Since it is a first release, I was keen on collecting, in case the terms would change.

The API is interestingly minimalistic: People, Activities and Comments are the three data types you can search, list and get. There are many other types of data, but they are attached to the aforementioned. For instance, a “People” can have several organisations, urls, placesLived and emails, although I don’t think the latter is available with the current version of the API.

As People are concerned, you may also get a hasApp (for the mobile app, we guess), languagesSpoken (an array of string) and even an intriguing currentLocation (Latitude/Maps integration, someone?). It’s interesting, but it’s also scary, from a user’s point of view, how much publicly accessible information there is.


The trouble with popular users…

Screenshot at 2011-11-17 09:47:06

At some point in our research project, it was a good idea to take all the users with more than a certain arbitrary large number of followers (say, 1000) and download their posts and analyze them. This doesn’t always seem to be the case anymore. Results are variable depending on the days.

We are set to release WeiboSphere, but will wait a little before pushing it. Right now, we’re taking every user with 1000 or more followers and get all their recent posts from the API. We aggregate and produce an unfiltered (at least not with a human filter) classification of the most popular posts by 24 hours, 48 hours, week, two weeks and one month.

Alas, in the last two days, all we’re seeing are female body parts, shoes and celebrities who returned to an incredibly thin size after a pregnancy.

The hope for now, until we improve the filters, is that we can see posts such as this one on an abducted girl in Guangzhou, posted yesterday morning.


Spawn more overlords?

Lucene -> Daemon

One of the biggest challenges in the project has been I/O. Throughout the networks that we check, we deal with large amounts of data that we need to write and read at every moment.

Lucene is a quick way to search through text, including that in Chinese language. We used to rely on the database to do this, but it turned out quickly to be terribly inefficient. To do a search, you had to visit every row (within the parameters given) and search for whether a term appeared.

We asked our HKU colleagues in the computer science department for help, namely Reza Sherkat, a former IBM employee, now a post-doctoral fellow with Nikos Mamoulis. He had previously given us advice on inverted indexes, which in a nutshell uses tokens of text (from the weibos, say) as keys in a gigantic array. The values in each element are what were the indexes in the table or type of objects that we are indexing (for weibos, it would be the weibo ID).

So, when you search a word, you effectively only go through a list of unique words/tokens, which returns a bunch of weibo IDs.

The second trick Reza told us about was the use of programs running in the background, or what are commonly called “daemons”. Like daemons, they are always there, waiting for a program to call it. A use we could make (or should make) would for instance be to keep a list of user IDs in memory. If you want to know whether a weibo was made by a user, no need to go to the database to check. You can do all of that in memory.

There are probably some more clever uses, such as for counting or going through large numbers of items.

It is known for instance that for Google and Facebook to achieve their levels of efficiency that all the data that passes through in fact just stays in memory. And the problem with memory is that it requires an electric current to stay alive. A power outage (which we think should never happen) and the data dies.

Operating in memory (in RAM, that is) is much much faster than having to fetch from a disk. It should make a difference, and we shall try it on our 48Gb of RAM.


New attributes in Sina Weibo API’s V2

Sina released a second version of their API about a month ago. It’s good to mention that we have moved some of our crawling scripts along to V2.

Of our interest, the status entity now had the following new attributes that can be used:

  • reposts_count
  • comments_count
  • mlevel

The first two are self-explanatory. The third probably means “maturity level”. We’re happy to get the first two and think it was about time that Sina start giving us exact numbers. To their defense, repost (and comment) numbers on Sina Weibo are much much higher than on Twitter, because status entities are much better preserved on Sina (on Twitter, those attention hoggers just keep re-writing posts to include their names).

  • allow_all_comment
  • avatar_large
  • verified_reason
  • bi_followers_count
  • verified_type
  • lang

It’s to be noted that the last two, verified_type and lang, were not documented yet, and I saw them just this afternoon (and promptly made it be reflected on our scripts). They are self-explanatory. Unlike for Twitter, Sina verification can be of several levels. My Weibo account is “verified”, but just because I was verified to be a JMSC employee (not because I am famous, bah). There are corpo accounts that get a different kind of verification, and there are probably others that I’m not certain about (power users?). “lang” is very interesting. We have mobile clients in English; we have Web interfaces in Traditional (for Taiwan) and Simplified Chinese (for the mainland). So, is Sina really preparing international versions?

***

After several e-mails from the audience, we do acknowledge that we also faced some auth problems, but were also lucky to have started the project early, such we don’t face some of the other problems (such as the need to specify in the devapp whether we are foreigners). I don’t know if it has anything to do with the rate limits you end up getting.

We’ve also had a few problems with the friendship (friends / followers) functions in V1. Those are still there in V2. Namely, the site won’t work with just OAuth. You also had to have cookies (thus a Web browser accessing the API URL, while signed on your Sina account). If you see some inconsistencies, feel free to e-mail us.

We’re submitting our social media project final report this week. So, expect in the next weeks (not months, I hope) that we release the tools we developed in the wild pretty soon. Some of them are already up on our GitHub.