Has Ai Weiwei been a popular topic on Sina Weibo?

According to this pic, it would seem so. But upon further inspection, only the first nine posts on this mosaic were made in May or after. Most came from early April when Ai was arrested at Beijing airport.

This is a search tool that we use to go through our weibo database, but only using posts with pictures attached to them. We had performance issues at adding new entries for two weeks due to the ongoing indexing. Nonetheless, the takehome message is that there were lots of posts in the days following the arrest, but not much since then. Since the archive is not up to date, there’s no indication whatsoever about the popularity of the topic now. (At least anecdotically, people have been talking about AWW on Weibo.)

(Note: I sampled a few AWW posts from the archive and when their actual page on weibo.com, you can’t find because they were presumably deleted.)


Indexing a table containing Chinese text with tsearch2 and bamboo (can take forever)

Screenshot-csam@lantau: ~-data-sinaweibo-mostretweeted

Phew, it’s been more than three months since the last entry. We are still continuing our data collection of Sina Weibo and it has become critically important to index.

What’s indexing? Imagine individual weibo posts as files. Without indexing, it’s like having your files just thrown into a mound of other files, in no particular order. That’s a bit daunting if you now have stored over 100 million entries. An index basically orders the weibos using a particular column as the guide. We chose a column that we frequently use for searches, such as the creation date of a weibo (a microblog post) in our example. The database creates an “index” that uses a particular data structure to speed up search and retrieval of entries, such as the binary tree or a hash function.

An index would typically take a lot of physical storage space. For instance, the data of these 100 million posts take up 15 GB, but the indexes (on the id, created_at, retweeted_status and user_id fields) take about 31GB all together.

However, we may also be interested in searching the weibos by keyword. Ordering weibos by their text field alone would not produce very efficient results. If you search for a keyword like 平安 (peace), you would basically have to go through all 100 million entries’ text fields to find the occurrence of these characters. A regular binary tree or hash index could only help you find entries containing “平安” if it started with these characters.

That’s where indexing techniques specially for full text search come into play. One of them is the inverted index, which is available with Tsearch2. The strategy is to go through all your entries, tokenize the text field you want to search, and store the token with identifiers for the rows where these tokens appear. For the text “我喜歡平安” (I like peace), you could obtain three tokens “我” (I), “喜歡” (like) and “平安” (peace).

Now, “我” is a very common word in Chinese, and is considered a stop word in natural language processing parlance (see Chinese stop words). This means it is so common that it is not very useful to keep track of its occurrence (it wouldn’t be efficient either, because a majority of posts would contain it).

To repeat, the index is basically a list of keywords (tokens) on one side, and some list of identifiers of rows that contain this keyword. So, effectively, when you search for posts containing “平安” the next time, instead of going through all the rows and all the text entries of the 100 million posts, you instead search in the index for “平安” and get the list of posts containing it. The keyword search will be much faster, but it also means updating the index every time you insert or update an entry. There’s a give and take.

That said, I have now been running the backlog of rows to index for almost two weeks now! That’s 24/7 for the past 14 days. The segmenting of Chinese sentences (using Bamboo) is CPU-intensive, but having to read and write through all 100 million entries and create the index is probably the bottleneck. It may never finish (or not before another couple of weeks), so I will probably quit the task and archive the old posts we’re not using any more. I may need to do an estimate of how long indexing the entire table would take, as there is no way to simply estimate the progress.