Taking a stab at data mining Dylan

UPDATE (June 3, 2018): About a year after initially attempting this project, I decided to take another stab at data mining Dylan. With more programming experience, especially in the world of “data science”, I wanted to try to do things in a cleaner and more sophisticated way, and produce a more interesting end product. You can view the result at data-mining-dylan.dustinmichels.com.

My goal was the same: count references to cities throughout Bob Dylan’s lyrics and make an interactive bubble map of the results. However I made a few interesting changes. The second time around:

  1. Scraping: I did the web scraping with Scrapy instead of Beautiful Soup.
  2. Data formats: I saved the web scraped data in a structured way (JSON) instead of plain .txt files
  3. Data processing: I did the data processing using Pandas within Jupyter Notebooks, rather than using pure Python. So much nicer!! (See code here.)
  4. Identifying cities in lyrics: I identified cities by using a simple regex to search for one or more capitalized words and then cross-referencing those words against a csv file listing world cities. This was much faster, simpler, and more effective than my original approach of using the nltk package to do named entity recognition, and then cross referencing that against my list of cities.
  5. Making an interactive map: Finally, for the end product, I created a custom mapping widget using Javascrpt, leaflet.js, and vue.js. Previously I just uploaded a csv of mapping data to CARTO. My tool is much better custom-tailored to this project: it let’s you click on a city on the map and easily see exactly which lyrics mention that city.
Summary of different techniques for data mining Dylan project, the first time vs. the second time.

I got to present my project to digital humanities scholars at Carleton College’s “Day of Digital Humanities 2018,” which was a gratifying conclusion to this independent project. (See slides here). The current version of my map is live at: data-mining-dylan.dustinmichels.com.

We know that the freewheelin’ Bob Dylan rambled and roamed all across the United States. He grew up bored and cold in the mining town of Hibbing, Minnesota. When he learned that his musical idol, Woody Guthrie, was on his death bed, he made a pilgrimage to NYC in hopes of seeing Guthrie in the hospital. Once he was in New York, Dylan hung around Greenwich Village for a while, soaking up new musical and lyrical styles from that 1960’s creative hub. He recorded an album, got himself famous, and went on to travel all over the US and the world.

We know he went lots of places. But which places did he sing about? To answer that question, I made a tentative foray into text mining with Python and its web scraping/ natural language processing modules, then mapped the results with Carto.com. Here’s the result, so far:

Step I. Retrieve Songs and Lyrics

The first step was to scrape lyrics from BobDylan.com, using Requests and Beautiful Soup. The Dylan page is well-organized, which made this not terribly difficult. First we go to bobdylan.com/songs, and get the links to the page of each song on the website.

Then we visit each song’s page one at a time, and extract the title, author, lyrics, and year, based on the HTML tags this content is contained in. This data gets stored as a custom Song class.

a code snippet

So we don’t have to keep scraping the web page, that dictionary gets written to a pickle file, and also written to a series of text files. It’s pretty exciting to watch it work!

Step II. Find Potential Places in Lyrics

The next step was to analyze the lyrics of each song and figure out if there are any places mentioned. I adapted some code I found online that uses nltk to do “named-entity recognition“– extract named entities (like people and places) from a block of text. It basically uses nltk to apply a part of speech tag to each word in the lyrics, and it returns a set of unique named entities. This takes a little while, on average 0.1225 seconds per song.

code snippet: using nltk for named-entity recognition
nltk for named-entity recognition

It does a decent job of finding people and places. For the song “John Wesley Harding” it spits out “John Wesley Harding” and “Chaynee County.” It also picks up some junk. I suspect it’s difficult to sentence diagram Dylan lyrics.

Step III. Determine which potential places are actually places

The next step was to cross reference that short list of potential places with a list of real places to see if there were any matches. I found a CSV file on SimpleMaps.com that contained a list of world cities and their latitude and longitude. My cross referencing code looks like this: open the csv, see if the name in the csv is in the set of potential names, and if so create a tuple (name, lat, lon) and add it to a list.

Check of city name is in set of potential city names, if so, make a tuple with (city name, lat, lon) and add it to list.

This step is faster than I expected. (On average, 0.0648 seconds per song.) From our list of potential places, it spits back a list of tuples containing actual places and their coordinates.

Step IV. Count occurrences  of cities

The next step is to count how many times that various places are mentioned. I do this with a dictionary called city_count, where the keys are city tuples (name, lat, lon) and the values are themselves dictionaries, with the keys “count" (the number of times the place is mentioned), “songs" (which songs mention the place) and “context" (little chunks of text surrounding the mention). My counting code looks something like this:

code snippet

Step V. Write to CSV file

Finally, I take all that information and write it to a CSV file. I used a small function to organize the data slightly little differently, then used Python’s csv.DictWriter() function to accomplish the task.

csv code snippet

The result looks something like this:

Step VI. Map it!

Finally, I uploaded the CSV to Carto.com and produced a slick bubble map! Carto makes it very easy to produce a variety of different kinds of maps from raw data, like CSV files. It will automatically figure out which of your data fields indicate location, and let’s you customize the base map, the look of your data points, and which fields to display upon hovering or clicking. I have used Carto in the past to map where I slept as I traveled through Europe, and where I pooped around town.

Some notes:

There are certainly some issues with this mapping. The most obvious, perhaps, is that we’ve picked up places that we shouldn’t have. When Dylan sings “Man gave names to all the animals” he isn’t talking about the town of Man on the Ivory Coast.

dylan map

The fact that you can click on a point and see the context in which is was mentioned mediates the damage of this kind of mistake– it’s easy to find out what he was really talking about. But the error still detracts from the overall meaningfulness of the map, especially when viewed from a distance, as a whole.

We also have duplicates. Every mention of “San Francisco,” for instance, is mapped to both San Francisco, California and San Francisco, Argentina. This is a little bit of a problem. I can imagine writing code that guesses which of two places Dylan is actually referring to, perhaps by weighing population size and location. (I’m thinking that a big city in the US is more likely to be the one he is referring to than a small city in Eastern Europe.) But I haven’t implemented anything like this yet.

What is perhaps a slightly less obvious issue is what we miss. In the song “Brownsville Girl” Dylan sings, “Well, we drove that car all night into San Anton‘ / And we slept near the Alamo, your skin was so tender and soft.” And in “California,” he sings “San Francisco’s fine / You sure get lots of sun.”

Unfortunately, the program doesn’t quite catch on to the fact that Dylan is using an abbreviation of “San Antonio” in the first song, and contracting the words “San Francisco is” in the second, and both of these place mentions get erroneously mapped to San, Burkina Faso, Africa. We can imagine baking in more flexibility to the name recognition to solve these issues.

We also miss– by design rather than error– most non-city place references (like “Hollywood”) as well as state, region, and country references, which abound in Dylan’s lyrics. In the nearby future, I’d like to try creating a choropleth map of state or country references.

So, the project is not complete, but it was an exciting foray into data mining Dylan. My code is still a little sloppy, but it is up on Github if you want to play with it or make edits. Enjoy!


CC BY-NC 4.0 Taking a stab at data mining Dylan by Dustin is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.