I was looking for a practical project to help me learn MongoDB. I realised that the ideal data source already existed in my Python ReTweeter project (FurzedownTweets). The basic Tweet object exposed using JSON by the Twitter API is a great example of the sort of hierarchical document which MongoDB is designed to handle.
I looked at installing MongoDB locally. However, as my ReTweeter bot is now running in the cloud, it would make sense to keep the data there too.
I discovered MongoDB Atlas - database as a service - hosted and managed by MongoDB in the cloud, and with a free tier providing 512MB storage over a 3 node replica set - ideal for a low data volume educational project. It also has a neat desktop GUI - MongoDB Compass , which makes connecting and reviewing data super easy, as well as showing stats relating to performance etc. However, it’s also possible to connect via the Mongo command line.
Extracting Twitter Data to Json
I did not want the availability of the database to compromise the functionality of my TwitterBot, so opted for a batch solution. For each status that is retweeted, I write the Tweet JSON out to a text file. The files are imported and rolled each night. This decouples the TwitterBot functionality from the database availability.
Python Script Modifications and Issues
After each retweet, I call a new saveTweetJsonToFile function. This is wrapped in a try..catch block.
I get latest daily file name, based on today’s date. This is created if it does not already exist.
The Twitter API I am using (Tweepy) exposes a status object which represents all the data relating to a given Tweet. It is not JSON serializable, but it does have a _json property which can be extracted using the Python json.dumps function.
MongoDB assigns an arbitrary _id value based on ObjectId to each record entered if it does not already contain an _id field. I wanted to override this default behaviour and use the Twitter Status ID as the _id value for each document inserted, as this is guaranteed to be unique, and also ties back to the Twitter status database. I solved this by creating an addDatabaseId method. This takes the json for one tweet, fetches the Twitter Status ID (stored in id_str) and assigns it to a new _id field. (Note - I use the string representation of the status id: id_str rather than the numeric one: id - these status id’s are very large, and Twitter API documentation recommends using the string version to ensure safe handling)
The end result is a daily file containing the full json for each tweet status that has been retweeted by my TwitterBot that day. eg:
Importing Json into MongoDB
I now have a text file for each day, containing all the data relating to statuses retweeted by my TwitterBot. The next step is to import this data into my MongoDB. For this, I used the pymongo Python/MongoDB library.
All batch processes should be re-runnable. My loadTweets.py script calculates yesterday’s start and end time, and deletes any data which may already exist between those times. This allows me to re-run the script safely if there are any problems.
I then open the json file, and extract the data into an array
I iterate the array and use the pymongo insert_one method to insert the json into MongoDB for each array element.
I was planning to do some analysis on my Twitter data by date and time. Unfortunately, the Date/Time fields in the json are converted to string when they are imported. I did not want to have to perform type conversions on these every time I ran a query in MongoDB. To solve this, after inserting my tweet data, I run an update function which finds any created_at fields in the collection which have string value, and converts them to datetime.
Conclusion
I now have a daily batch process, (run via a cron job) which processes my daily json file and uploads it to my MongoDB database. This has given me, though obviously not big data, a reasonable base from which to experiment with MongoDB querying, aggregation and map/reduce.
It’s also expanded my Python knowledge in having to overcome various issues as described above - and, as an added bonus, I’ve learnt the basics of MarkDown Next step is to create some web pages which will show analyses of the data being recorded by FurzedownTweets - who knows, it may be useful to someone!