Skip to main content

MIR Spectrogram

Frequency-domain visualization for music information retrieval. Configure FFT parameters, window functions, frequency scales, and color maps per track.

Loading...

Window Functions

Window functions shape the analysis frame before the FFT, controlling the trade-off between frequency resolution (main lobe width) and spectral leakage (sidelobe level).

  • Hann — The most common window for spectral analysis. Good balance of frequency resolution and sidelobe suppression (~31 dB down). Smooth cosine taper to zero at both edges.
  • Hamming — Similar to Hann but does not taper to zero, leaving a small discontinuity at the edges. First sidelobe is ~42 dB down, but sidelobes decay more slowly than Hann.
  • Blackman — Wider main lobe than Hann, but sidelobes are ~58 dB down. Good choice when detecting weak signals near strong ones.
  • Blackman-Harris — Four-term cosine sum with sidelobes ~92 dB down. Excellent dynamic range at the cost of the widest main lobe. Useful for high dynamic range measurements.
  • Bartlett — Triangular window with ~26 dB sidelobe suppression. Simpler than cosine windows, with sidelobes that decay faster than rectangular but worse frequency resolution than Hann.
  • Rectangular — No windowing (uniform weighting). Narrowest main lobe giving the best frequency resolution, but only ~13 dB sidelobe suppression. Use when the signal is already periodic within the frame or for transient detection.

Frequency Scales

The frequency axis can be mapped using different perceptual and mathematical scales, each suited to different analysis tasks.

  • Linear — Uniform spacing in Hz. Best for narrowband analysis, harmonic series inspection, and engineering measurements where equal Hz spacing is needed.
  • Logarithmic — Spacing proportional to frequency (octave-based). Matches musical pitch perception — each octave occupies equal visual space. Standard for wideband audio analysis.
  • Mel — Perceptual scale based on pitch perception experiments. Approximately linear below 1 kHz and logarithmic above. Widely used in speech recognition, speaker identification, and MIR feature extraction (MFCCs).
  • Bark — Based on critical bands of the auditory system (24 Bark bands). Each band corresponds to a region of the basilar membrane. Used in psychoacoustic models, audio codec design, and perceptual loudness measurement.
  • ERB (Equivalent Rectangular Bandwidth) — Models auditory filter bandwidths more accurately than Bark at low frequencies. Common in computational auditory scene analysis (CASA) and hearing research.

Color Maps

All color maps are perceptually uniform — equal steps in data value produce equal perceived brightness changes — ensuring the visualization does not introduce visual artifacts.

  • Viridis — Blue-green-yellow. The default choice for scientific visualization. Perceptually uniform, colorblind-friendly, and prints well in grayscale.
  • Magma — Black-purple-orange-yellow. High contrast at both ends of the range. Good for highlighting low-energy detail against a dark background.
  • Inferno — Black-purple-red-yellow. Similar to magma with warmer midtones. Useful when distinguishing mid-range energy levels.
  • Grayscale — Black to white. Familiar and intuitive, works universally in print and on screen.
  • Inverted Grayscale — White to black. Dark regions represent high energy, matching traditional paper-based spectrogram plots.
  • Roseus — Black-pink-white. Perceptually uniform alternative to the common "hot" colormap without the misleading luminance reversals.

FFT Parameters

The Short-Time Fourier Transform (STFT) parameters control the resolution and appearance of the spectrogram.

  • FFT Size — Number of samples per analysis frame (256–16384). Larger sizes give finer frequency resolution but coarser time resolution (the time-frequency uncertainty principle). 2048 or 4096 are common starting points for music analysis at 44.1 kHz.
  • Hop Size — Number of samples between successive frames. Smaller hops give smoother time resolution (more overlap between frames). Typical values are 1/4 to 1/2 of the FFT size.
  • Zero Padding — Appending zeros to the frame before the FFT interpolates the frequency bins, producing a visually smoother spectrogram without changing the actual frequency resolution.
  • dB Range & Gain — The dynamic range controls which magnitudes are visible. A typical range of 80–120 dB shows most musical content. Gain offsets the entire scale, useful for normalizing quiet recordings.
Audio Credits: "Whiptails" by Albert Kader — Minimal Techno stems from the Cambridge Music Technology multitrack library. Licensed under CC BY 4.0.