A couple days ago I graduated from the UVA Masters of Data Science program. During the program, most of my projects (as my classmates continuously pointed out) tended to revolve around applying machine learning algorithms in a music setting.
This post is based on the final project from one of my favourite classes, called “Computer Vision and Language” taught by Vicente Ordonez. The general idea was to leverage state-of-the-art image recognition algorithms alongside language models - as Vicente put it the goal was to “build machine learning and deep learning models that can reason about images and text.”
You’ve definitely seen the applications of this work in recent years: searching through your photo galleries by keyword (“show me pictures of beaches with sunsets”) or the hilarious “AI generated song lyrics.” Even though some of the papers assigned in the class had been written only weeks or months before (2014 is considered “old” in computer vision), there does seem to be some norms and best practices emerging.
With the stage set, lets dive into the problem. My teammate, Seth Green (world renown bassist and amateur mixologist) and I had spent a lot of time talking about album artwork contribution to the overall feel of an album.
Some of my favourite albums simply have a FEEL when you look at the album cover and I was convinced that if Seth and I could both look at the same album cover and agree on its “mood” we could train a computer to do the same. For example, the Arctic Monkeys’ first album Whatever People Say I Am, That’s What I’m Not is labelled with the moods Literate, Lively, Boisterous, Brash, Confident:
We set about training models that - if everything went right - would tell you the mood of an album based only on the input image. As you can imagine, this is a pretty difficult task, so much of our work revolved around tweaking our models and performing post-mortems to determine what we had actually learned.
OK, Computer, How Do You See?
Without getting too deep into the nitty-gritty details, I think it’s going to be helpful to preface the rest of this post with how exactly our algorithm is reasoning with an image - so at least read through this section before skipping around.
To start, machine learning is really just a sexy catch-all name for computer models that can learn themselves without every operation being explicitly programmed by a human. In ML, the general goal is to train an algorithm that can find some trend in a dataset and try to form some conceptual knowledge about how to solve that problem in the future.
When we build ML models, good practice is to “test” them on their ability to generalize their knowledge and reasoning abilities on new/held-out/unseen information. Maybe our model got 100% correct on the training data, but only 10% on the testing data - this would indicate our model really just memorized patterns in the training data rather than picked up on some more general rule and reasoning. Building ML models is not magic; most of the process involves framing the problem correctly asking the right questions to ask of the data before fitting a model.
In the computer vision setting, the models we are trying to train are deep convolutional neural networks. The basic idea is of a CNN is - given an input image - we want to learn a set of “filters” that isolate important information.
While it’s not a perfect parallel, you could think of these filters like the ones on Instagram that isolate colors and shapes or increase edge sharpness. We want our network to learn thousands of these filters that work together to identify shapes, colors, and textures.
Given enough consistent images and labels to train on, our network will eventually develop to “understand” deeper concepts like faces, bodies and objects. The network learns to associate patterns in filter activations with human provided labels like dog or cat.
A quick visualization of the filters these networks learn might help:
This visualization shows a network that learned how to isolate various edges and color spots. The networks are called “convolutional” since these filters sort of scan across an image “looking” for certain features. The deeper we go in the network, the more complex these filters get. There are tons of great resources online to better understand this, including the site this image is from.
It is important to note that by training on millions of images, our computer hasn’t become sentient in the “AI is taking over the world by the end of the week” sense you hear on morning talk shows. I think François Chollet - the guy who wrote the deep learning package we used - explains what is being learned best: “Current supervised perception and reinforcement learning algorithms require lots of data, are terrible at planning, and are only doing straightforward pattern recognition.”
So to conclude this brief intro:
- Machine learning is the process of learning patterns from data and trying to apply that knowledge to new examples.
- A deep convolutional neural network is a model that takes an image and tries to learn how patterns within are associated with the objects in the image or the overall image description.
- They are useful tools since analysts don’t necessarily have to go in and tell the computer what a “dog” should look like - it will eventually learn the patten of fur, eyes, tail, and legs from the data. As you can imagine the model needs to look at a LOT of pictures with dogs and without to best understand how it differs from cats or wolves.
As mentioned before, our goal with this paper was to see if we could associate the objects, colors and layouts of an album cover with a specific mood. Some work has been done with album covers and genres, but we were really interested in trying to get a “semantic” or emotional read on an album cover.
Our data were acquired from AllMusic.com by a custom web scraper I wrote in Python. Essentially, we downloaded the details for every single album on the site that had been officially reviewed.
Official reviews include a text blurb, album artwork and most importantly: a list of ordered album moods. There were about 200 moods originally, but after some filtering and data cleaning we whittled it down to 167.
The final “AlbumNet” dataset consisted of 127,467 albums. Since our AlbumNet image set is significantly smaller than the millions of images typically used to train a CNN, we opted to use a bit of “transfer learning” magic to start out with a model that already knew some stuff.
Sebastian Ruder has a great explainer on this area of ML, but the general idea is that you pre-train a network on other (typically bigger and cleaner) data to have a good starting point for your own analysis:
If this seems like cheating - you’re right it kind of is. It allows us to:
- Get much better results with our tiny data set
- Avoid super long training times and
- The model starts out knowing a bit more about the world (it already understands the difference between chairs, cars, cigarettes and strawberries from the start).
We started out with one of the most famous object recognition / classification models, VGG-16. My classmate Colin Cassady explains why VGG-16 is so magical since
"it was the model that showed the world that small convolutional filters can capture the same spatial information in an image as large filters, when many deep convolutional layers are used. Trying to engineer filter sizes is not something people have done after the VGG paper."
In other words, VGG-16 learns thousands of tiny filters that operate over the original image and its filtered version and outputs a “deep representation” of the image.
With the VGG-16 acting as a “feature extracting” step, Seth and I designed a handful of architectures that simply build on top of the VGG prefix:
The graphic above shows the VGG-16 layers that start each of our 4 models. Some models subsequently incorporate basic metadata like genres or release year. Since models 3 and 4 incorporate semantic embeddings of the labels in the training they use cosine similarity loss functions rather than binary crossentropy loss.
Seth and I started out by loading up the pre-trained VGG model and transforming it to predict each of the 167 moods instead of the 1,000 categories (dogs, cat, truck, spaghetti) that the model was originally trained on. As shown above, I ran an album cover through the model before this transformation and it was clear that the model was picking up on furniture and other objects. These models can take a very long time to train, so we usually use AWS cloud based GPUs.
- Our models sucked at picking up on the moods, there is no getting around that.
- After a few tweaks and tests it became clear that, although our models were learning some things, they did not exhibit satisfactory results on hold out data.
- In my opinion, our first issue was that our hypothesis was totally wonky, since the album covers you remember the moods for are memorable for a reason.
- When you listen to the Stones’ Beggars Banquet you can feel that dirty grafitted bathroom in Mick’s lyrics - the connection seems clear and strong. However, the Stones’ next album, “Let it Bleed” has equally cocky and baggadocious feel but the cover is basically a birthday cake.
- My thought was that once you scaled up this hypothesis to hundreds of thousands of album covers, all the weird niche music tastes out there might not have any consistency at all.
- It also appeared that our model was getting confused by the meanings of the labels. It tended to guess “energetic” for albums that were labeled “upbeat”. To Seth and I these labels are essentially the same thing so it’s not quite wrong to guess either.
What we decided to do was extend our model’s understanding of the “label space”. In other words, some labels are semantically closer to others (happy, upbeat) while others are very different (friendly, angry).
There has been a bunch of recent research on how to get the “semantic meaning” in the context of a larger vocabulary. One method proposed by Google (Word2Vec) aims to “embed” words into a multidimensional semantic space such that more similar words are clustered closer and more different words are further. More than that, these word embeddings encode a ton of relational information about words. The most famous example is:
We acquired an embedding for each of the 167 labels from Stanford’s GLOVE project (pretty similar to Word2Vec) that was trained by crawling the entire internet and learning how words, slang, acronyms and even misspellings are used in juxtaposition to one another.
So now our task isn’t to look at an album cover and say “this is energetic!” but instead try to place that album in the most appropriate spot of the semantic label space - say, between energetic and happy and upbeat and amiable/good natured.
While these models didn’t out-perform the original label-only approach, I find the mistakes it makes to be more “semantically reasonable”. Below, I visualized the 300-dimensional embedding space in 2 dimensions (using T-SNE) along with an album cover art placed where each prediction fell. You can see the predictions fall approximately in the middle of the blue “true” moods which is exactly what we were trying to do!
So What Did Our Model Learn?
To compare our models against “human performance” Seth and I each reviewed 50 randomly chosen album covers. We used a metric called “Cohen’s Kappa” which measures “interrater agreement” - how much two people agree in their label choices. On average, our agreement threshold fell into the “seldom agree” category, but what was notable is that we performed about as well as the models. We also calculated micro Precision/Recall AUC and tracked training and validation loss, but I won’t go into detail on that here.
This was all the work we had time to do before our paper deadline, so the project stopped there. What we turned in was a process to try and guess the emotional content of an album given only its cover and some basic metadata. Even if we extended the models to leverage external semantic understandings of the labels, we were unable to create an algorithm that “got it” every time.
After graduation, I was interested in seeing what parts of the album covers excited the model most. This is one big issue with deep, black box models like CNNs; there is no one commonly accepted approach to post-mortem and figure out what models learned and where they went wrong. One way of figuring out what the model picked up on are saliency maps. These maps are essentially starting from the end of the model where the prediction is made and saying:
* "Hey which label did I think this album got? I thought it was 'sensual' with 87.5%" * "Which filters were most responsible for making me think that" (repeated the whole way back to the beginning of the model) * "Which pixels were most responsible for these filters getting excited and thinking they spotted something?" * *Highlight Responsible Pixels*
I tossed a couple albums into this setup and the results are below:
You can see how the model tends to make decisions based on face orientation when available. On the Jay-Z album, there was a large amount of attention given to the fabric folds of his jacket (rather than, say, his posture or the cigarette he holds).
Throughout all the papers and studies I read during my master’s degree, I don’t remember a single one that presented “bad results”. I guess it makes sense - researchers want to put their best foot forward and always show success.
But in continually failing to create a glorious monster algorithm that knew album covers inside and out, I believe that Seth and I actually learned a lot more about how to tune neural networks and how to rework hypotheses and processes in a scientific fashion. I guess the conclusion here is that failing at a project can urge you to be more creative and scrappy in your problem solving.
Finally, this experiment suggests that album covers play only a small part in the overall “mood” of an record’s experience. Humans label moods quite subjectively, and, combined with the lack of a consistent relationship between album covers and these labels it makes this task damn near impossible. However, I am still convinced that these covers could be used to spice up a conventional collaborative-filtering music recommender (maybe include how close two albums are in semantic space?). Let me know if you have any thoughts or questions in the comments!