Unlocking the Power of Music Genre Classification

With the explosion of audio content online, organizing and categorizing music has never been more important. As millions of new tracks are uploaded daily, music streaming services need to be smart about how they recommend songs to listeners. The key to this personalization? Accurately classifying music by genre.

Imagine opening your favorite music app and instantly being presented with songs that perfectly match your mood. That’s the magic of genre classification at work—streaming services are constantly improving how they tag and recommend music, making your listening experience more enjoyable and tailored to your taste.

Spectrograms: Visualizing Audio

So, how do these services figure out which songs belong to which genres? One powerful technique involves transforming audio files into something called spectrograms. These are visual representations of sound that look like images, making it possible to treat the entire genre classification task as an image classification problem.

images/spectrograph.png — Spectrographs are good ways to capture audio data as images.

Spectrograms represent songs as two-dimensional images with:

Time plotted on the x-axis,
Frequency on the y-axis, and
Amplitude or intensity represented through a color spectrum.

Each spectrogram gives us a unique fingerprint of a song, allowing algorithms to “see” and categorize music much like how we recognize objects in photos.

Data Scraping and Spectrogram Conversion

Every predictive model needs a sizeable data source for it to be trained to an appropriate performance level. Fortunately, music is widely available on the web, and the use of some clever Python packages can help to streamline this process into an easy one.

These are some Python packages that are very helpful for music and audio analysis:

youtube-search: Using a URL library, developers can scrape videos from YouTube with regular expression matching.
pafy: is a library that enables media to be downloaded from YouTube URLs.
librosa: A powerful library that converts audio files into spectrograms, making it possible to analyze and classify music.

Turning Sounds Into Structured Information

Before we can teach a machine to recognize genres, the audio data must be preprocessed to ensure accuracy and consistency:

Resizing: Spectrogram images are resized to 64x64 pixels. This standardizes the input, making it easier for algorithms to process.
Normalization: The data is normalized with a mean of 0.5 and a standard deviation of 0.5, ensuring the model performs consistently across different audio samples.
Balancing: The dataset is trimmed to maintain equal representation of each genre, minimizing bias and improving the fairness of the model.

These steps ensure that the classification model can learn effectively and produce reliable results, giving listeners a seamless and personalized music discovery experience.

Before beginning, it’s also important to take a quick look at the data.

Tuning the Knobs and Dials

Creating a reliable model for genre classification isn’t just about feeding spectrograms into an algorithm—it’s about fine-tuning the process to achieve the best possible performance. This is where training techniques and hyperparameter selection come into play, acting as the secret sauce that elevates a model from good to great.

Train, Test, and Validate: Building a Solid Foundation

In machine learning, one of the most crucial steps is dividing your data into three distinct sets: training, validation, and testing. The training set is used to teach the model, while the validation set helps fine-tune it. By tweaking the model based on validation results, you ensure it generalizes well to new, unseen data. Finally, the test set is used to evaluate the model’s performance, giving you a clear picture of how well it will perform in real-world scenarios.

Learning Rate Schedule: Keeping the Model on Track

A well-tuned learning rate can make all the difference. By using a dynamic learning rate schedule, the model can adjust how quickly it learns based on its performance on the validation set. If the model’s validation loss improves, the learning rate might stay the same or even increase slightly to speed up training. If the loss plateaus or worsens, the learning rate can decrease, preventing the model from overshooting the optimal solution.

Early Stopping: Knowing When to Quit

Training a model for too many epochs can lead to overfitting, where the model learns the training data too well and performs poorly on new data. On the flip side, stopping too early can result in underfitting, where the model hasn’t learned enough. Early stopping is a technique that monitors the model’s performance and halts training when no significant improvements are being made, ensuring the model is just right.

Let’s Try It Out!

As a proof of concept, four common genres (Jazz, Classical, Techno, Metal) of music was collected and converted into spectrographs for training, testing and evaluation. These four categories will form the backbone of all subsequent performance analysis hereafter.

Logistic Regression

Kicking things off with a simple model like logistic regression is a smart way to establish a baseline for performance. While it might seem odd to apply logistic regression to flattened pixel data from spectrograms, it’s a useful exercise to gauge how much more complex models might improve on basic performance.

Despite its simplicity, the logistic regression model managed a respectable test loss of 0.4279 and test accuracy of 85.43%. While these numbers set the lower bound, they also highlight the potential gains more sophisticated models could achieve.

Multilayered Perceptron Model

Next up was the Multilayered Perceptron (MLP), a step up from logistic regression. The MLP introduces hidden layers, which allow the model to capture more complex patterns in the data. Starting with a baseline (no hidden layers), the results were consistent with logistic regression, but adding hidden layers revealed some interesting dynamics:

Hidden Layers Test Loss Test Accuracy
0 (Baseline) 0.4279 85.43%
1 0.4654 84.01%
2 0.4285 83.69%

Hidden Layers	Test Loss	Test Accuracy
0 (Baseline)	0.4279	85.43%
1	0.4654	84.01%
2	0.4285	83.69%

Surprisingly, the model’s performance dipped slightly with one and two hidden layers. This indicates that while additional complexity might help in some cases, it can also introduce challenges, especially if the network isn’t properly tuned or if the data isn’t sufficiently complex to warrant the extra layers.

Convolutional Neural Networks Model

Given the nature of spectrograms as image-like data, it makes sense to explore Convolutional Neural Networks (CNNs), which are particularly effective at image classification tasks. CNNs are excellent at extracting spatial patterns from it inputs, which is an essential feature in uncovering hidden patterns in music.

The base CNN performed impressively with a test loss of 0.2804 and a test accuracy of 91.53%. Techniques like

Batch Normalization,
Max Pooling, and
Dropout

played a significant role in this success, helping the model generalize well to new data.

Network Test Loss Test Accuracy
Base CNN 0.2804 91.53%
Deep CNN 1.2801 27.07%
AlexNet 0.4929 81.22%
VGG11 (BN) 0.7008 65.94%

Network	Test Loss	Test Accuracy
Base CNN	0.2804	91.53%
Deep CNN	1.2801	27.07%
AlexNet	0.4929	81.22%
VGG11 (BN)	0.7008	65.94%

Interestingly, the deeper CNN struggled with overfitting, showing that more layers and parameters aren’t always better, especially with a limited dataset. Simpler architectures like AlexNet and VGG11 with Batch Normalization performed better but still fell short of the Base CNN’s accuracy.

Problems with CNNs

While CNNs excel at capturing spatial features in spectrograms, they struggle with temporal data — a critical aspect of music analysis. The challenge lies in the fact that music is not just about frequency patterns; it’s also about how these patterns change over time.

Additionally, the relatively small dataset used here made it difficult for more complex CNNs to learn effectively, leading to overfitting.

To address these limitations, two key strategies can be explored:

Temporal Effects: Incorporating models that can capture the temporal dynamics of music, such as Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks.
Transfer Learning: Leveraging pretrained models that have already learned to recognize patterns in vast datasets, thereby jumpstarting the learning process for music genre classification.

RNN-CNN / LSTM-CNN Models

To better capture the temporal features inherent in music, these CNNs were combined with Recurrent Neural Networks and Long-Short Term Memory models. This hybrid approach allowed the each CNN to extract spatial features from the spectrograms while the RNN or LSTM handled important temporal information, resulting in a more holistic understanding of the audio data.

Network Test Loss Test Accuracy
CNN + RNN 0.2166 92.72%
CNN + LSTM 0.2608 91.31%

Network	Test Loss	Test Accuracy
CNN + RNN	0.2166	92.72%
CNN + LSTM	0.2608	91.31%

The combination of CNNs with RNNs yielded the best performance, improving both test loss and accuracy.

Making Use of Transfer Learning

With the challenge of training deep networks from scratch on a limited dataset, transfer learning is a viable alternative. By starting with a model pretrained on a vast dataset (like ImageNet, with 14 million images), already-learned features and patterns can be used to improve the performance on a specific dataset such as the one on music spectrographs we have here.

There are two methods for transfer learning on a pretrained model: either by moving all weights by running backpropogation on the dataset, or only moving a small subset of them.

Frozen Weights

Initially, all layers of the VGG11 model were opened up for finetuning. This approach yielded a test loss of 0.4917 and a test accuracy of 80.80% — a decent performance, but not groundbreaking.

Full Fine-tuning

To push the model further, only some layers were allowed to be retrained on the music data. This significantly improved the results, achieving a test loss of 0.2215 and an impressive test accuracy of 93.01%. Fine-tuning allowed the model to adapt to the nuances of music genres, resulting in a highly accurate classifier.

A Quick Glance at All the Results

Network	Pretrained	Test Loss	Test Accuracy
Baseline (Log Reg)	No	0.4279	85.43%
Basic CNN	No	0.2804	91.53%
CNN + RNN	No	0.2166	92.72%
CNN + LSTM	No	0.2608	91.31%
AlexNet	No	0.4929	81.22%
AlexNet	Yes	0.4012	90.80%
VGG11	No	0.7008	65.94%
VGG11 (Frozen)	Yes	0.4917	80.80%
VGG11	Yes	0.2215	93.01%

As music streaming services continue to harness the power of AI and machine learning, the way we discover and enjoy music will only get better. Next time you get a spot-on recommendation, you’ll know there’s some cutting-edge tech—and maybe even a little bit of magic—working behind the scenes.

Spectrograms: Visualizing Audio#

Data Scraping and Spectrogram Conversion#

Turning Sounds Into Structured Information#

Tuning the Knobs and Dials#

Train, Test, and Validate: Building a Solid Foundation#

Learning Rate Schedule: Keeping the Model on Track#

Early Stopping: Knowing When to Quit#

Let’s Try It Out!#

Logistic Regression#

Multilayered Perceptron Model#

Convolutional Neural Networks Model#

Problems with CNNs#

RNN-CNN / LSTM-CNN Models#

Making Use of Transfer Learning#

Frozen Weights#

Full Fine-tuning#

A Quick Glance at All the Results#