In [1]:
%matplotlib inline
import numpy, scipy, matplotlib.pyplot as plt, IPython.display as ipd
import librosa, librosa.display
import stanford_mir; stanford_mir.init()

Short-Time Fourier Transform

Musical signals are highly non-stationary, i.e., their statistics change over time. It would be rather meaningless to compute a single Fourier transform over an entire 10-minute song.

The short-time Fourier transform (STFT) (Wikipedia; FMP, p. 53) is obtained by computing the Fourier transform for successive frames in a signal.

$$ X(m, \omega) = \sum_n x(n) w(n-m) e^{-j \omega n} $$

As we increase $m$, we slide the window function $w$ to the right. For the resulting frame, $x(n) w(n-m)$, we compute the Fourier transform. Therefore, the STFT $X$ is a function of both time, $m$, and frequency, $\omega$.

Let's load a file:

In [2]:
x, sr = librosa.load('audio/brahms_hungarian_dance_5.mp3')
ipd.Audio(x, rate=sr)
Out[2]:

librosa.stft computes a STFT. We provide it a frame size, i.e. the size of the FFT, and a hop length, i.e. the frame increment:

In [3]:
hop_length = 512
n_fft = 2048
X = librosa.stft(x, n_fft=n_fft, hop_length=hop_length)

To convert the hop length and frame size to units of seconds:

In [4]:
float(hop_length)/sr # units of seconds
Out[4]:
0.023219954648526078
In [5]:
float(n_fft)/sr  # units of seconds
Out[5]:
0.09287981859410431

For real-valued signals, the Fourier transform is symmetric about the midpoint. Therefore, librosa.stft only retains one half of the output:

In [6]:
X.shape
Out[6]:
(1025, 9813)

This STFT has 1025 frequency bins and 9813 frames in time.

Spectrogram

In music processing, we often only care about the spectral magnitude and not the phase content.

The spectrogram (Wikipedia; FMP, p. 29, 55) shows the the intensity of frequencies over time. A spectrogram is simply the squared magnitude of the STFT:

$$ S(m, \omega) = \left| X(m, \omega) \right|^2 $$

The human perception of sound intensity is logarithmic in nature. Therefore, we are often interested in the log amplitude:

In [7]:
S = librosa.amplitude_to_db(abs(X))

To display any type of spectrogram in librosa, use librosa.display.specshow.

In [8]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(S, sr=sr, hop_length=hop_length, x_axis='time', y_axis='linear')
plt.colorbar(format='%+2.0f dB')
Out[8]:
<matplotlib.colorbar.Colorbar at 0x1c31a3d2b0>

Mel-spectrogram

librosa has some outstanding spectral representations, including librosa.feature.melspectrogram:

In [9]:
hop_length = 256
S = librosa.feature.melspectrogram(x, sr=sr, n_fft=4096, hop_length=hop_length)

The human perception of sound intensity is logarithmic in nature. Therefore, like the STFT-based spectrogram, we are often interested in the log amplitude:

In [10]:
logS = librosa.power_to_db(abs(S))

To display any type of spectrogram in librosa, use librosa.display.specshow.

In [11]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(logS, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
Out[11]:
<matplotlib.colorbar.Colorbar at 0x10cd6e898>

Using y_axis=mel plots the y-axis on the mel scale which is similar to the $\log (1 + f)$ function:

$$ m = 2595 \log_{10} \left(1 + \frac{f}{700} \right) $$

librosa.cqt

Unlike the Fourier transform, but similar to the mel scale, the constant-Q transform uses a logarithmically spaced frequency axis.

To plot a constant-Q spectrogram, will use librosa.cqt:

In [12]:
fmin = librosa.midi_to_hz(36)
C = librosa.cqt(x, sr=sr, fmin=fmin, n_bins=72)
logC = librosa.amplitude_to_db(abs(C))
In [14]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(logC, sr=sr, x_axis='time', y_axis='cqt_note', fmin=fmin, cmap='coolwarm')
plt.colorbar(format='%+2.0f dB')
Out[14]:
<matplotlib.colorbar.Colorbar at 0x1c23e1f6a0>