Sina Weibo deleted posts archive

Since Sina Weibo has a pretty good API, and since we do download lots of data every day, it just makes good sense to keep an archive of deleted posts.

The strategy is very straightforward and only incurs a negligible extra number of hits against the API:
– Take the statuses/user_timeline function for each user in your list (we have 2,500 in a sub-list).
– Extract the IDs of all 200 posts in the response and save as a text file, one ID per line. They are already ordered chronologically.
– You should have a previous list of IDs. Use diff to compare both files.
– Loop through the output of the diff. Mark all the IDs that appear in the previous version, but not the new one.
– Those IDs are the deleted posts.
– Mark them, and send your alert, etc. (We also hit the API again on statuses/show to double-check if the post was really deleted.)
– Overwrite the old ID list with the new one.
– Repeat whenever you can fetch a new version of the timeline (you might be rate-limited by Sina if you do it too often).

