%matplotlib inline
import seaborn
import numpy, scipy, matplotlib.pyplot as plt, sklearn, pandas, librosa, urllib, IPython.display, os.path
plt.rcParams['figure.figsize'] = (14, 5)

Exercise: Genre Recognition¶

Goals¶

Extract features from an audio signal.
Train a genre classifier.
Use the classifier to classify the genre in a song.

Step 1: Retrieve Audio¶

Download an audio file onto your local machine.

filename_brahms = 'brahms_hungarian_dance_5.mp3'
url = "http://audio.musicinformationretrieval.com/" + filename_brahms
if not os.path.exists(filename_brahms):
    urllib.urlretrieve(url, filename=filename_brahms)

Load 120 seconds of an audio file:

librosa.load?

x_brahms, fs_brahms = librosa.load(filename_brahms, duration=120)

Plot the time-domain waveform of the audio signal:

librosa.display.waveplot?

# Your code here:

Play the audio file:

IPython.display.Audio?

# Your code here:

Step 2: Extract Features¶

For each segment, compute the MFCCs. Experiment with n_mfcc to select a different number of coefficients, e.g. 12.

librosa.feature.mfcc?

n_mfcc = 12
mfcc_brahms = librosa.feature.mfcc(x_brahms, sr=fs_brahms, n_mfcc=n_mfcc).T

We transpose the result to accommodate scikit-learn which assumes that each row is one observation, and each column is one feature dimension:

mfcc_brahms.shape

Scale the features to have zero mean and unit variance:

scaler = sklearn.preprocessing.StandardScaler()

mfcc_brahms_scaled = scaler.fit_transform(mfcc_brahms)

Verify that the scaling worked:

mfcc_brahms_scaled.mean(axis=0)

mfcc_brahms_scaled.std(axis=0)

Step 2b: Repeat steps 1 and 2 for another audio file.¶

filename_busta = 'busta_rhymes_hits_for_days.mp3'
url = "http://audio.musicinformationretrieval.com/" + filename_busta

urllib.urlretrieve?

# Your code here. Download the second audio file in the same manner as the first audio file above.

Load 120 seconds of an audio file:

librosa.load?

# Your code here. Load the second audio file in the same manner as the first audio file.
# x_busta, fs_busta =

Listen to the second audio file.

IPython.display.Audio?

Plot the time-domain waveform and spectrogram of the second audio file. In what ways does the time-domain waveform look different than the first audio file? What differences in musical attributes might this reflect? What additional insights are gained from plotting the spectrogram? Explain.

plt.plot?

# See http://musicinformationretrieval.com/stft.html for more details on displaying spectrograms.
librosa.feature.melspectrogram?

librosa.logamplitude?

librosa.display.specshow?

Extract MFCCs from the second audio file. Be sure to transpose the resulting matrix such that each row is one observation, i.e. one set of MFCCs. Also be sure that the shape and size of the resulting MFCC matrix is equivalent to that for the first audio file.

librosa.feature.mfcc?

# Your code here:
# mfcc_busta =

mfcc_busta.shape

Scale the resulting MFCC features to have approximately zero mean and unit variance. Re-use the scaler from above.

scaler.transform?

# Your code here:
# mfcc_busta_scaled =

Verify that the mean of the MFCCs for the second audio file is approximately equal to zero and the variance is approximately equal to one.

mfcc_busta_scaled.mean?

mfcc_busta_scaled.std?

Step 3: Train a Classifier¶

Concatenate all of the scaled feature vectors into one feature table.

features = numpy.vstack((mfcc_brahms_scaled, mfcc_busta_scaled))

features.shape

Construct a vector of ground-truth labels, where 0 refers to the first audio file, and 1 refers to the second audio file.

labels = numpy.concatenate((numpy.zeros(len(mfcc_brahms_scaled)), numpy.ones(len(mfcc_busta_scaled))))

Create a classifer model object:

# Support Vector Machine
model = sklearn.svm.SVC()

Train the classifier:

model.fit?

# Your code here

Step 4: Run the Classifier¶

To test the classifier, we will extract an unused 10-second segment from the earlier audio fields as test excerpts:

x_brahms_test, fs_brahms = librosa.load(filename_brahms, duration=10, offset=120)

x_busta_test, fs_busta = librosa.load(filename_busta, duration=10, offset=120)

Listen to both of the test audio excerpts:

IPython.display.Audio?

IPython.display.Audio?

Compute MFCCs from both of the test audio excerpts:

librosa.feature.mfcc?

librosa.feature.mfcc?

Scale the MFCCs using the previous scaler:

scaler.transform?

scaler.transform?

Concatenate all test features together:

numpy.vstack?

Concatenate all test labels together:

numpy.concatenate?

Compute the predicted labels:

model.predict?

Finally, compute the accuracy score of the classifier on the test data:

score = model.score(test_features, test_labels)

score

Currently, the classifier returns one prediction for every MFCC vector in the test audio signal. Can you modify the procedure above such that the classifier returns a single prediction for a 10-second excerpt?

# Your code here.

Step 5: Analysis in Pandas¶

Read the MFCC features from the first test audio excerpt into a data frame:

df_brahms = pandas.DataFrame(mfcc_brahms_test_scaled)

df_brahms.shape

df_brahms.head()

df_busta = pandas.DataFrame(mfcc_busta_test_scaled)

Compute the pairwise correlation of every pair of 12 MFCCs against one another for both test audio excerpts. For each audio excerpt, which pair of MFCCs are the most correlated? least correlated?

df_brahms.corr()

df_busta.corr()

Display a scatter plot of any two of the MFCC dimensions (i.e. columns of the data frame) against one another. Try for multiple pairs of MFCC dimensions.

df_brahms.plot.scatter?

Display a scatter plot of any two of the MFCC dimensions (i.e. columns of the data frame) against one another. Try for multiple pairs of MFCC dimensions.

df_busta.plot.scatter?

Plot a histogram of all values across a single MFCC, i.e. MFCC coefficient number. Repeat for a few different MFCC numbers:

df_brahms[0].plot.hist()

df_busta[11].plot.hist()

Extra Credit¶

Create a new genre classifier by repeating the steps above, but this time use training data and test data from your own audio collection representing two or more different genres. For what genres and audio data styles does the classifier work well, and for which (pairs of) genres does the classifier fail?

Create a new genre classifier by repeating the steps above, but this time use a different machine learning classifier, e.g. random forest, Gaussian mixture model, Naive Bayes, k-nearest neighbor, etc. Adjust the parameters. How well do they perform?

Create a new genre classifier by repeating the steps above, but this time use different features. Consult the librosa documentation on feature extraction for different choices of features. Which features work well? not well?

← Back to Index