Short-Time Fourier Transform

Short-Time Fourier Transform#

import IPython.display as ipd
import matplotlib.pyplot as plt
import librosa.display

from mirdotcom import mirdotcom

mirdotcom.init()

Musical signals are highly non-stationary, i.e., their statistics change over time. It would be rather meaningless to compute a single Fourier transform over an entire 10-minute song.

The short-time Fourier transform (STFT) (Wikipedia; FMP, p. 53) is obtained by computing the Fourier transform for successive frames in a signal.

\[ X(m, \omega) = \sum_n x(n) w(n-m) e^{-j \omega n} \]

As we increase \(m\), we slide the window function \(w\) to the right. For the resulting frame, \(x(n) w(n-m)\), we compute the Fourier transform. Therefore, the STFT \(X\) is a function of both time, \(m\), and frequency, \(\omega\).

Let’s load a file:

filename = mirdotcom.get_audio("brahms_hungarian_dance_5.mp3")
x, sr = librosa.load(filename)
ipd.Audio(x, rate=sr)

librosa.stft computes a STFT. We provide it a frame size, i.e. the size of the FFT, and a hop length, i.e. the frame increment:

hop_length = 512
n_fft = 2048
X = librosa.stft(x, n_fft=n_fft, hop_length=hop_length)

To convert the hop length and frame size to units of seconds:

float(hop_length) / sr  # units of seconds
0.023219954648526078
float(n_fft) / sr  # units of seconds
0.09287981859410431

For real-valued signals, the Fourier transform is symmetric about the midpoint. Therefore, librosa.stft only retains one half of the output:

X.shape
(1025, 9813)

This STFT has 1025 frequency bins and 9813 frames in time.

Spectrogram#

In music processing, we often only care about the spectral magnitude and not the phase content.

The spectrogram (Wikipedia; FMP, p. 29, 55) shows the the intensity of frequencies over time. A spectrogram is simply the squared magnitude of the STFT:

\[ S(m, \omega) = \left| X(m, \omega) \right|^2 \]

The human perception of sound intensity is logarithmic in nature. Therefore, we are often interested in the log amplitude:

S = librosa.amplitude_to_db(abs(X))

To display any type of spectrogram in librosa, use librosa.display.specshow.

plt.figure(figsize=(15, 5))
librosa.display.specshow(
    S, sr=sr, hop_length=hop_length, x_axis="time", y_axis="linear"
)
plt.colorbar(format="%+2.0f dB")
<matplotlib.colorbar.Colorbar at 0x1c31a3d2b0>
../../_images/90968f9cce44cf7e19e1a8fa247237b701e9cdf57faac1a97105a5901981fbd9.png

Mel-spectrogram#

librosa has some outstanding spectral representations, including librosa.feature.melspectrogram:

hop_length = 256
S = librosa.feature.melspectrogram(y=x, sr=sr, n_fft=4096, hop_length=hop_length)

The human perception of sound intensity is logarithmic in nature. Therefore, like the STFT-based spectrogram, we are often interested in the log amplitude:

logS = librosa.power_to_db(abs(S))

To display any type of spectrogram in librosa, use librosa.display.specshow.

plt.figure(figsize=(15, 5))
librosa.display.specshow(
    logS, sr=sr, hop_length=hop_length, x_axis="time", y_axis="mel"
)
plt.colorbar(format="%+2.0f dB")
<matplotlib.colorbar.Colorbar at 0x10cd6e898>
../../_images/88ab2c0f6e490bdcda404983c9c83d93d0a2ed7de2c823323192878a401099fe.png

Using y_axis=mel plots the y-axis on the mel scale which is similar to the \(\log (1 + f)\) function:

\[ m = 2595 \log_{10} \left(1 + \frac{f}{700} \right) \]

Constant-Q Transform#

Unlike the Fourier transform, but similar to the mel scale, the constant-Q transform uses a logarithmically spaced frequency axis.

To plot a constant-Q spectrogram, will use librosa.cqt:

fmin = librosa.midi_to_hz(36)
C = librosa.cqt(x, sr=sr, fmin=fmin, n_bins=72)
logC = librosa.amplitude_to_db(abs(C))
plt.figure(figsize=(15, 5))
librosa.display.specshow(
    logC, sr=sr, x_axis="time", y_axis="cqt_note", fmin=fmin, cmap="coolwarm"
)
plt.colorbar(format="%+2.0f dB")
<matplotlib.colorbar.Colorbar at 0x1c23e1f6a0>
../../_images/3a44accbe570deffcf9a608ef50d4970da904b10ffe5d1c6dccdc30490087674.png