Indexing a table containing Chinese text with tsearch2 and bamboo (can take forever)

Screenshot-csam@lantau: ~-data-sinaweibo-mostretweeted

Phew, it’s been more than three months since the last entry. We are still continuing our data collection of Sina Weibo and it has become critically important to index.

What’s indexing? Imagine individual weibo posts as files. Without indexing, it’s like having your files just thrown into a mound of other files, in no particular order. That’s a bit daunting if you now have stored over 100 million entries. An index basically orders the weibos using a particular column as the guide. We chose a column that we frequently use for searches, such as the creation date of a weibo (a microblog post) in our example. The database creates an “index” that uses a particular data structure to speed up search and retrieval of entries, such as the binary tree or a hash function.

An index would typically take a lot of physical storage space. For instance, the data of these 100 million posts take up 15 GB, but the indexes (on the id, created_at, retweeted_status and user_id fields) take about 31GB all together.

However, we may also be interested in searching the weibos by keyword. Ordering weibos by their text field alone would not produce very efficient results. If you search for a keyword like 平安 (peace), you would basically have to go through all 100 million entries’ text fields to find the occurrence of these characters. A regular binary tree or hash index could only help you find entries containing “平安” if it started with these characters.

That’s where indexing techniques specially for full text search come into play. One of them is the inverted index, which is available with Tsearch2. The strategy is to go through all your entries, tokenize the text field you want to search, and store the token with identifiers for the rows where these tokens appear. For the text “我喜歡平安” (I like peace), you could obtain three tokens “我” (I), “喜歡” (like) and “平安” (peace).

Now, “我” is a very common word in Chinese, and is considered a stop word in natural language processing parlance (see Chinese stop words). This means it is so common that it is not very useful to keep track of its occurrence (it wouldn’t be efficient either, because a majority of posts would contain it).

To repeat, the index is basically a list of keywords (tokens) on one side, and some list of identifiers of rows that contain this keyword. So, effectively, when you search for posts containing “平安” the next time, instead of going through all the rows and all the text entries of the 100 million posts, you instead search in the index for “平安” and get the list of posts containing it. The keyword search will be much faster, but it also means updating the index every time you insert or update an entry. There’s a give and take.

That said, I have now been running the backlog of rows to index for almost two weeks now! That’s 24/7 for the past 14 days. The segmenting of Chinese sentences (using Bamboo) is CPU-intensive, but having to read and write through all 100 million entries and create the index is probably the bottleneck. It may never finish (or not before another couple of weeks), so I will probably quit the task and archive the old posts we’re not using any more. I may need to do an estimate of how long indexing the entire table would take, as there is no way to simply estimate the progress.

3 responses to “Indexing a table containing Chinese text with tsearch2 and bamboo (can take forever)”

  1. Han-Teng Liao says:

    Hi, this is Han-Teng Liao from Oxford Internet Institute. I have met your supervisor Dr. Fu at DC this year.

    I am not sure if the following tool can help. It is a tool developed by a programmer now behind Plurk:

    “Loso主要是我替Plurk寫的一套中文斷詞系統,稍後有空我會寫篇文章來介紹這套系統
    專案網址: https://github.com/victorlin/loso

    I am not so sure if it is more efficient than Bamboo, though.

    If you are interested in analyzing the geographic and linguistic scripts used in Weibo’s external links, we can have a chat about it in the near future.

  2. Cedric Sam says:

    Hi Han-Teng. Thanks, looks interesting. Hmm, I wonder if I can find more about his segmentation method? Let me write you an e-mail.

  3. Cedric Sam says:

    (Btw, interesting blog that you have there! http://people.oii.ox.ac.uk/hanteng/)

Leave a Reply