Saturday, October 10, 2015

Extract Historic Tweets

Extracting historic Twitter data is always problematic, because the Twitter Rest API can only retrieve 3,200 of a user’s most recent Tweets. That being said, if you want to find what YouTube tweeted between 2013 and 2014, this task can be almost impossible by suing the REST or Stream API only.

 However, Twitter Advanced Search provides historic tweets based on user's defined query, including the time of posting a tweet. For example, this link  from Twitter Advanced Search gives you the full list of tweets that YouTube posted between 2013 and 2014. 

Since the parameters in the request URL of the Twitter Advanced Search can be customized, it is possible to extract historic tweet information by sending request to the Twitter Advanced Search, and extract tweet information from the returned webpages.

Here is a simple Python script that can extract the historic tweet ID for specific twitter users. The logical is straightforward: 

  1. Customize the URL to request the historic tweets of specific twitter users during a defined time period;
  2. Get the responded page from the Twitter Advanced Search;
  3. If the returned page is short as one page, use BeautifulSoup package to extract the twitter id from the webpage;
  4.  If the returned page is longer than one page, use Selenumu to scroll down the webpage to get all the twitter id;
  5. Use Twitter Search API to extract the tweet contents from those collected tweet ID.