Friday, December 11, 2015

Reddit Discussions

It's no secret that I am a big fan of They do some great statistical analysis of sports, entertainment and politics. They also have some interactive data sections where they take a topic and let you get the data on it. Take for example this one on the information site This is a pretty thriving community of Internet users who participate on discussion board from a large range of topics (some inappropriate). That being said, they have scraped the site and found the usage of certain key words and matched them up against each other. Take, for example the usage of Batman, Superman or Spiderman over the last 8 years (and 1.7 billion comments) or so. 

When you go to it will immediately randomly choose a few keywords. There are many choices and you can click Shuffle to get a new set (BEWARE that some of the search terms are swears so I wouldn't click that in class) but you can also just type in any keywords that you want to compare. This is similar to Google Trends but just for Reddit

The Analysis

On any graph you can drag the sliders on the zoom bar to zoom into any place on the graph. You can also adjust the smoothing which will change how many days the averaging relate to. The graphs made are essentially broken line graphs and you can get the data for any set by clicking on the Download the Data button. The values in that CSV file represent the percent of the total number of comments that that word or phrase accounts for in the given time period.

Sample Questions

  • Identify any trends in the data.
  • Identify why there might be spikes in the data. That is, what was happening in the news at that time that might cause people to use those words

Download the Data

CSVs can be downloaded from any data set.

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

No comments:

Post a Comment