Friday, June 9, 2017

Five Thirty Eight's Pile of Data


UPDATE: Now even more of their data is available and easier to get at, you guessed it, their data site: https://data.fivethirtyeight.com/

I have always found it tough to find interesting data sets. Especially those that are not contrived. At Five Thirty Eight they are constantly looking at the world through data. Their primary posts tend to be about politics or sports but often they have posts on pop culture and other items. For example, recently they had a post titled "Why Classic Rock Isn't What it used to be". In that post they analyzed over 37000 plays of classic rock songs spanning decades. And not only have they done the work, they've made all of the raw data available. All 37673 pieces in a csv file.

Downloading the Data

So basically they have a Github site where they make much of the raw data available for many of their stories. They have a lot of data related stories and although most of them are not on this site there are almost 100 that are. So for example, you could look at the article about how deadly it is to be an Avenger and see that the article doesn't have any graphs but there is a bunch of data where you could do a histogram or something with the categorical data.

Or if you were a Bob Ross Fan (real or ironic) then you can get the data the analyzed on the paintings he created for his show. Here's the article, but on the GitHub site you get the raw data plus, as an added bonus for you code jockeys, the Python script that they used to create the data set. Most have the link to the original article.
Note that when you see the CSV file listed, you can't just right click and download the file. That will just get you the script used to get the data. To get the actual data, click the CSV link and then copy the data from the table that appears.

Some other interesting sets are on Fandango's movie ratings, or the connections between the actors in the movie Love Actually or their data on the popularity of unisex names.

One small warning. This is raw data and in a few cases really raw. For example the data set about the number times someone cursed or bled out in a Quentin Tarantino movie is very cool but totally inappropriate for a classroom (there are 1895 pieces of data in this set).

Check them all out on the sites:
https://data.fivethirtyeight.com/
https://github.com/fivethirtyeight/data

No comments:

Post a Comment