Importing My Twitter Data Into MongoDB

Importing My Twitter Data Into MongoDB

Twitter.com doesn’t seem to have a way to let you accesss your oldest tweets through their web client or apps. Because I wanted to go through my oldest posts, starting from the beginning of when I joined Twitter in 2016, I decided to take matters into my own hands by exporting my twitter data and then importing it into a MongoDB instance. Some post-processing was needed to make the data more convenient to work with in Mongo, so I used aggregate pipelines for that.

Step 1: Exporting my data from Twitter.

In Twitter’s web client, under your profile menu go to Settings & Privacy. Then under Settings & Data (password may be required) scroll to the bottom of the page for the section Download your Twitter data. From here you can request your entire history of Twitter data, including tweets, favorites, followers etc. When choosing what data to download, make sure you change the Twitter data format from HTML to JSON, to make it easier to work with. Requesting your data will take some time, so you have to wait for an email that links back to the Download or view data page. I got the link from Twitter in my inbox in about 30 minutes. The total size of my archive was 1.6 GB, which includes about 3 years worth of activity and over 12,000 tweets and replies.

Step 2: Download and Extract my Twitter data

I downloaded the data archive and extracted it locally. The contents are a series of json (.js) files for different aspects of data or metadata related to my account. Which is all interesting, but the file I was interested in was tweet.js which contains all posts or replies from my account in one long list. Uncompressed this is a 16 MB text file. In order for me to import this data into MongoDB, I had to do a little cleanup. I coped the file to a new file tweet_copy.json. I used the JSON extension because one MongoDB application was very particular about .json extensions for files. The name doesn’t really matter.

Also I edited the file and removed the assignment on the first line which was: window.YTD.tweet.part0 = [ {...

So in tweet_copy.json, the first line just started with [ {...

Step 3: Install MongoDB

I installed MongoDB Server Community Edition, which is free a NoSQL database that can easily support JSON structured data. My goal is to import my twitter posts here to be able to sort and find older posts more easily than the current Twitter interfaces allow.

So installed MongoDB 4.0.7 from https://www.mongodb.com/download-center/community and had it up and running in minutes.

Step 4: Import twitter data to Mongo

So to import twitter json file as a new db “twitter_data” and collection named “twitter” you can use the mongoimport executable from a Terminal session.

./mongoimport --db twitter_data --collection twitter --file "/Users/brianl/Downloads/twitter-2019-03-23-46e7/tweet_copy.json" --jsonArray

To easily see the data in Mongo, I installed the standard MongoDB gui client: Compass, which had it’s pros and cons. But essentially I could connect to my local database and browse my posts now!

Step 5: Shape the data for better filtering and sorting

An example of the default format of the data from the Twitter file would be something like this:

{
  "retweeted" : false,
  "source" : "<a href=\"http://twitter.com/#!/download/ipad\" rel=\"nofollow\">Twitter for iPad</a>",
  "entities" : {
    "hashtags" : [ ],
    "symbols" : [ ],
    "user_mentions" : [ {
      "name" : "NASA's Kennedy Space Center",
      "screen_name" : "NASAKennedy",
      "indices" : [ "77", "89" ],
      "id_str" : "16580226",
      "id" : "16580226"
    } ],
    "urls" : [ ]
  },
  "display_text_range" : [ "0", "89" ],
  "favorite_count" : "0",
  "in_reply_to_status_id_str" : "938953481653772287",
  "id_str" : "938979392532942847",
  "in_reply_to_user_id" : "16580226",
  "truncated" : false,
  "retweet_count" : "0",
  "id" : "938979392532942847",
  "in_reply_to_status_id" : "938953481653772287",
  "created_at" : "Fri Dec 09 03:51:47 +0000 2017",
  "favorited" : false,
  "full_text" : "Its amazing to see up close at @NASAKennedy",
  "lang" : "en",
  "in_reply_to_screen_name" : "NASAKennedy",
  "in_reply_to_user_id_str" : "16580226"
}

One of the problems I faced was that all of the numeric and date type attributes here, such as “id” or “created_at” were in string formats. This made trying to sort on them clunky at best in Mongo. So I decided to use Mongo’s aggregation feature with a pipeline of transformations that would essentially copy the data into a more usable collection format in the twitter_data database.

In my case the pipeline was three stages:

  • $project to copy every individual record, but transform individual attributes to a new format/data type
  • $sort on the twitter id
  • $out which creates a physical copy of the transformed records in a new collection (so I leave the original data intact)

To do this I used the Mongo Shell:

> use twitter_data
> db.twitter.aggregate([ 
{$project: {
    id:{$toLong: "$id"},
    created_at: {$dateFromString:{dateString: "$created_at"}},
    display_text_range: "$display_text_range",
    entities: "$entities",
    favorite_count:{$toInt: "$favorite_count"},
    favorited: "$favorited",
    full_text: "$full_text",
    id_str:"$id_str",
    in_reply_to_screen_name:"$in_reply_to_screen_name",
    in_reply_to_status_id:"$in_reply_to_status_id",
    in_reply_to_user_id:"$in_reply_to_user_id",
    lang: "$lang",
    retweet_count: {$toInt: "$retweet_count"},
    retweeted: "$retweeted",
    source: "$source",
    truncated: "$truncated",
    extended_entities: "$extended_entities"
}},
{$sort:{id: 1}},
{$out: "twitterFixed" }
],
{ allowDiskUse: true })

Step 6: Querying the data in Mongo with Compass

Using Compass, you can easily do a filter or sort query, so I could sort by the Id or do other filtering. $sort:{id: 1} already sorts by oldest first, so I didn’t need to do much to look at the data.

Future

I plan to make a very simple web app ui to this database, to let me go through all of my old tweet data, that’s something I’ll probably do with Node JS. TBD.