The Grateful Dead in data: an exploratory analysis

Note aux lecteurs francophones

Merci pour l'intérêt que vous portez à mon travail. Cet article a initialement été rédigé en langue anglaise, et je n'ai pas le projet de le traduire pour l'instant. Notez que le reste de mon site est généralement écrit en Français.

introduction

If I've ever talked to you more than one minute in my life, you probably know how I'm a huge fan of the Godfather of all jam bands : The Grateful Dead.
The band is known for their extensive collections of life performances, playing over 2300 concerts over their thirty years career. Today, they remain one of the highest-grossing American touring act. As I am both a Grateful Dead fan ( a deadhead ) and a data analyst. Let's try and see what we can learn about the band through data analysis.

Gathering the data

There's plenty of ways to access datasets related to the Grateful Dead. One of the most complete set of data related to their career probably available over at setlist.net. However they do not provide the dataset itself on the form of a .json or .csv. Same goes for the deadbase setlist explorer. When I first started working on this project, I grabbed Andrew Blance's JerryPycia dataset under the form of a .csv file, which provides the venue, city, country, date, year, setlist and state of the 1602 Dead shows ever taped). However, this was limiting, as most of the data was qualitative.

Eventually, I thought I'd use the Spotify API: Spotify uses amazing technology to explore audio features and in-depth audio analysis of tracks hosted on their platform. Their technology to read calculated audio features of tracks to learn about its danceability, energy, valence, and more. For more advanced use cases, it is possible to read in-depth analysis data about tracks such as the segments, tatums, bars, beats, pitches, etc. Even better : Spotify has a very developer-friendly API one can use to stream their services via apps and websites.

One issue, obviously, is Spotify does not have the whole catalog of the Grateful Dead live sets on their platform. Far from it, as only a little quantity of them have been released as records through label companies.

There are various ways in which Grateful Dead live albums are released:

On the form of sporadic, traditional releases
Through the Dick's Picks series: which started in 1993, was named after Grateful Dead tape vault archivist Dick Latvala. Latvala selected shows with the band's approval and oversaw the production of the albums. After Latvala's death in 1999, David Lemieux became the Dead's tape archivist and took over responsibility for producing subsequent Dick's Picks releases, as well as his own Dave's Picks series.
Through the Road Trips series: The Road Trips series of albums is the successor to Dick's Picks. The series started after the Grateful Dead signed a ten-year contract with Rhino Records to release the band's archival material. The Road Trips releases are created using two-track concert recordings, but unlike Dick's Picks they each contain material from multiple concerts of a tour.
Through the Digital Download Series, which are concerts that used to be available through the Grateful Dead's online store, but which are now available on streaming platforms

Most tutorials on exploring the Spotify API use the very good spotipy package, however, I felt it was too much for my needs, as I really only wanted to gather the audio-features endpoint of the Spotify API for each Grateful Dead song.

Getting a Spotify API Token is quite straightforward, and there are tons of tutorials online on how to get one. I found it easier to just grab everything and clean my dataset later on. I could simply iterate over the every Grateful Dead album in the Spotify catalog like this:

for album in d['items']:
    album_name = album['name']
    
    trim_name = album_name.split('(')[0].strip()
    if trim_name.upper() in albums:
        continue
    albums.append(trim_name.upper())
    
    print(album_name)
    
    r = requests.get(BASE_URL + 'albums/' + album['id'] + '/tracks', headers=headers)
        tracks = r.json()['items']
    
    for track in tracks:
        f = requests.get(BASE_URL + 'audio-features/' + track['id'], headers=headers)
    f = f.json()
    
    f.update({
        'track_name': track['name'],
        'album_name': album_name,
        'short_album_name': trim_name,
        'release_date': album['release_date'],
        'album_id': album['id']
    })
    data.append(f)
    
df = pd.DataFrame(data)

In order to access them, we send a properly formed GET request to the API server, using an access_token in the header. Let's save this header info now, in the following very specific format:

{
    'acousticness': 0.446,
    'analysis_url': 'https://api.spotify.com/v1/audio-analysis/6y0igZArWVi6Iz0rj35c1Y',
    'danceability': 0.54,
    'duration_ms': 234910,
    'energy': 0.59,
    'id': '6y0igZArWVi6Iz0rj35c1Y',
    'instrumentalness': 0,
    'key': 0,
    'liveness': 0.14,
    'loudness': -4.359,
    'mode': 1,
    'speechiness': 0.0528,
    'tempo': 119.878,
    'time_signature': 4,
    'track_href': 'https://api.spotify.com/v1/tracks/6y0igZArWVi6Iz0rj35c1Y',
    'type': 'audio_features',
    'uri': 'spotify:track:6y0igZArWVi6Iz0rj35c1Y',
    'valence': 0.267
}

Amazing! Even cooler is Spotify gives a breakdown of the meaning of each of these values.

Trimming the data

This method gathered everything in the Spotify Catalog, and I had to make a choice regarding which albums to keep in the dataset. I decided to use:

The Dick's Picks series
The Dave's Picks series
The Download series
And a few arbitrary picks:
- Every available show from the 1972 tour in Europe
- The fan-favourite Cornell 5/8/77 (Probably the best show of their career)
- My favourite show that is in the Spotify Catalog: Winterland Ballroom 12/31/78
- Live at Fillmore East, February 11, 1969
- Live at Shrine Auditorium, August 24,1968 (Note: In the Spotify catalog, this show is mispelt as Shrine Auritorium )

Cleaning the data

Then came the ~~fun~~ interesting part : Cleaning the dataset I had just gathered. The biggest part was to clean the song name so that they'd fit together. Spotify isn't rigorous in its song naming and I ended up with hundreds of song names related to the same song such such as:

Song name
Me And My Uncle
Me & My Uncle
Me & My Uncle - Live
Me and My Uncle - Live at Fairgrounds Arena, Oklahoma City, OK, October 19, 1973
Me And My Uncle - Live at the Fox Theatre, St. Louis, MO 12/10/71
Me & My Uncle (Live at The Fillmore East, New York, NY, April 29, 1971) - 2021 Remaster
etc.

Notice how the song name changes from show to show, even the and is sometimes changed to an ampersand.

There are various ways to tackle this issue, obviously, regular expressions could help, but I found it was even easier to use approximate string matching, especially the n-gram fingerprint method, developed by Google when building their (now open-source!) openrefine. They describe their method as:

The n-gram fingerprint method does the following:

change all characters to their lowercase representation

remove all punctuation, whitespace, and control characters

obtain all the string n-grams

sort the n-grams and remove duplicates

join the sorted n-grams back together

normalize extended western characters to their ASCII representation

This worked surprisingly well, and allowed me to cluster every song name into one.

Adding a date column to our dataset

Unfortunately, our dataset did not contain the dates at which the shows were recorded. I noticed that, for most of the shows, the dates were contained, either in the album name, or within the track names themselves. I came across a Python module called datefinder, which describes itself as:

A Python module for locating dates inside text. Use this package to extract all sorts of date like strings from a document and turn them into datetime objects.

This module finds the likely datetime strings and then uses dateutil to convert to the datetime object.

Then, adding a date column was easy with:

import datefinder
    
    def finddate(x):
        matches = list(datefinder.find_dates(x))
        if len(matches) > 0:
            date = matches[0]
            return date
        else:
            return 'No values found'
    
    df['date'] = df['track_name'].apply(lambda x: finddate(x))

Exploratory analysis

Let's begin with a simple visualisation and see really what we're working with. To achieve this, let's create a scatterplot of every show on our dataset, with the date on the x-axis, and, for this example, the mean energy on the y-axis:

Following this method, we can instantly come to a few interesting conclusions that we'll have to keep in mind going forward into our analysis:

Most of the shows in our dataset are before the year 1980. It is not surprising in the sense that, despite the band's longevity having spanned until the 90s, most fans agree that the best shows are in their earlier career. Interestingly, this seems to correlate with the death of their keyboardist Keith Godchaux in 1980.
The small cluster at the beginning of the dataset is the band's famous 1972 tour in Europe, which is a fan favourite.

However, we found out that we're immediately met with an issue in our method: By calculating the mean value of anything in a show, we only take into account every track as having the same weight in the calculated mean. For the sake of this exercise, let's take into account the duration of each track and weight their values for each album. This operation solves the issue of having every track count for one iteration in the course of an album, regardless of their length. The math behind this is quite simple, with: $$ \bar{x}_{w} = {\sum_{i = 1}^{n} w_{i} x_{i} \over \sum_{i = 1}^{n} w_{i}} $$ Which, in pandas terms, translates to:


    df['energy_weighted'] = df.energy*df.duration_ms
    df_weighted['energy_weighted'] = (df.groupby(by='album_name').sum().energy_weighted / df.groupby(by='album_name').sum().duration_ms).to_list()

Let's see how different our plot looks with this method:

Now, let's look at how it would look using the valencecolumn: Spotify describes this value as:

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Basically, the "happiness" of a track, weighted with the duration of each track, the graph below shows us the most "happy, cheerful, euphoric..." shows.

Interesting, look at that point up there in May 1977, this is the fan-favourite 5/8/77 live at Cornell University.

For those unfamiliar with the Dead, 5/8/77 live at Cornell is widely considered the band's greatest performance. The recording was even selected for inclusion in the National Recording Registry of the Library of Congress. Plenty of articles have been articles have been written about why this show is the best out of the Dead's entire catalog, due to its insane energy and pristine recording condition for 1977. But it looks like we've also proven with data that this show is one of the best!

A note on keys and modes

The dataset we gathered also takes in count the key in which their songs were played. Interesting! Let's visualise the quantity of instances of songs played in each key:

Note: The Grateful Dead was famous for playing their songs in Mixolydian mode, which is when the 7th note has been lowered by a half step (semitone). Following this logic, a song in A mixolydian mode could be interpreted as a song in B natural minor by an algorithm, depending on whether it can pinpoint the root notes (which I assume it does).As I've looked a bit through the dataset, there are, in fact, some errors in the way Spotify guesses the key songs are in, though it does not seem related to a bad overlook on the song mode. I really wonder how and why they can sometimes get them wrong, perhaps this has to do with the background noise of the audience in most shows messing up the algorithm? I honestly don't know! Anyway, these mistakes are rare, and don't really matter in the grand scheme of things

As it happens, most songs by the Dead are in the keys of A major and G major, which does not seem surprising considering those are the most popular keys overall in rock music, but this graph doesn't tell us much more than that. I guess it'd be interesting to compare this to various other rock bands of this era, but let's move on. I have also tried seeing if there would be any tendency in key changes over the years. To no avail. Since I'll be focusing on showing interesting graphs and discoveries, trust me on this one.

Mapping the variance of songs over their career

Most songs by the Grateful Dead are not set in stone when it comes to the way they're played, the key they're in, their mood, their lengths, and even sometimes, their lyrics. My favourite song, "Jack Straw" went from slow and steady in the early 70s, to flat out raucous and jamming their later years with Brent Mydland on the keyboard. The examples are numerous. Some songs did not change much over the course of their careers, having the same length, tempo, lyrics, for years, some others, though, are known to have followed many variations throughout the years. Let's try and use data to figure out which individual songs changed the most in their career. For the sake of visualisation, let's focus on the 20 most played songs:

Let's start by visualising the variance in tempo through the years:

It turns out, the majority of their popular songs didn't have a major change in tempo throughout their careers. However, a few of them sticks out:

As having an increase in tempo throughout the years:
- Good Lovin'
- Beat it Down the Line
- Not Fade Away
As having a decrease in tempo throughout the years:
- He's Gone
- The Other One

Following these observations, we can use this logic and map out how the variance in each song compares to each other, by removing the dates in the x-axis and using a box plot, which also has the benefit of not taking the outliers into account:

On the topic of song length

The Grateful Dead is known for their extensive catalog of songs, and the versatility of genres they play. Some of them being short folk ballads, and other being long, psychedelic jams. Moreover, since the Grateful Dead is a jam band, it's usual for the listener to know when a song begin, but never to know when it ends. Again, we can use a box plot to visualise the song duration of each instance of a song. It's not unusual for the listener to sometimes not even notice if a song has ended, as the band often made seamless transitions from song to song. The Spotify catalog, however, is usually quite good at pinpointing the right moment for song transitions

Interesting! There are quite a few takes we can get from this visualisation:

First of all, Bob Weir's cow-boy ballads (Mama Tried, El Paso) are all the way down there and are rarely longer than 2 minutes and a half. The fan-favourite Dark Star is up there at the top with a mean duration of around 10 minutes. However, I am pretty certain that our dataset of concerts in the Spotify catalog largely underestimates the length at which Dark Stars could go. In fact, there are plenty of longer, stranger Dark Stars out there on archive.org. One of them even going as long as over 43 minutes, about the length of Beethoven's Sixth Symphony. Some songs are surprisingly very consistent with their length, due to the rare variance in their tempo, and the rare variance in their jam potential.

Good Lovin', Playing in the Band, The Other One, Not Fade Away... Are all songs that have never been set in stone in the span of their careers. They're obviously up there as the songs with the most variance in their duration, sometimes with a 15 minutes difference between the shortest and longest instances.

Finding out the jammiest show

A value we haven't looked at yet in our dataset is the instrumentalnesscolumn, which Spotify explains as:

Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

In data terms, this means going back to the weighted values we calculated earlier, and applying them to the instrumentalness column. No need for a fancy graph for this one, in pandas terms, we can do it like this:

  df_weighted.sort_values(by='instrumentalness_weighted', ascending=False).head()

Which provides us with:

album_name	instrumentalness_weighted
Dick's Picks Vol. 2: Ohio Theater, Columbus, OH 10/31/71 (Live)	0.609904
Europe '72 Vol. 7: Live at Beat Club, Bremen, West Germany 4/21/1972	0.567638
Dick's Picks Vol. 7: Alexandra Palace, London, England 9/9/74 - 9/11/74 (Live)	0.567625
Dick's Picks Vol. 1: Curtis Hixon Hall, Tampa, FL 12/19/73 (Live)	0.566187
Dick's Picks Vol. 32: Alpine Valley Music Theater, East Troy, WI 8/7/82 (Live)	0.523358

It looks like the second Dick's Picks contains the jammiest show of the whole Spotify catalog, a look at the setlist for this date quickly shows us why:

Dark Star
Sugar Magnolia
St. Stephen
Not Fade Away (1) >
Going Down the Road Feeling Bad >
Not Fade Away (2)

Most of these songs are quite the pinnacle of the Grateful Dead's jam shows, with Dark Stars and segues all around, I'm not surprised that data picks this specific show for the jammiest of all.

Conclusion

Admittedly, the dataset I've gathered only included Grateful Dead shows that happened to be in the Spotify catalog (which is only 1663 individual songs and only 89 shows), but it pales in comparison to the sheer amount of songs that the Dead have played in the career, which is probably ten times more than this. (My favourite show, for example, isn't even on the Spotify catalog). Most concerts the Dead have played are available on archive.org, but I have not found a way to run an auditory analysis on them. It's also technically possible to go wild and investigate the structure of individual tracks with the audio-features endpoint that gives things temperature and pitches per beat of the song. For example, why use Spotify-meta-data like “danceability” when you could just cluster directly on the second-by-second timbre and rhythms of each song?

Another fun idea could've been to do more deliberate clustering based on the various "eras" of the band, admittedly, the Pigpen era, the Donna and Keith era, the Brent era, the 90s, etc.

I hope you found this interesting, and, in the words of the band themself: "We bid you goodnight".