Chroma-Based Audio Features - Music Segmentation 65

Part II Music Segmentation 65

5.2 Chroma-Based Audio Features

In our segmentation procedure, we assume that we are given a transcription of a reference tune in the form of a MIDI file. Recall from Section 5.1 that this is exactly the situation we have with the songs of the OGL collection. In the first step, we transform the MIDI reference as well as the audio recording into a common mid-level representation. Here, we

2Dutch Song Database, http://www.liederenbank.nl

3www.lilypond.org

Figure 5.1: Representations of the beginning of the first stanza of the folk song OGL27517.

(a) Score representation of the manually generated reference transcription. (b) Chromagram of the MIDI representation of the transcription. (c) Smoothed MIDI chromagram (CENS feature).

(d)Chromagram of an audio recording (CENS feature). (e) F0-enhanced chromagram as will be introduced as first enhancement strategy in Section 5.5.

use the well-known chroma representation, as described in this section.

Chroma features have turned out to be a powerful mid-level representation for relating harmony-based music, see [4; 6; 90; 123; 143; 162; 164]. Assuming the equal-tempered scale, the termchroma refers to the elements of the set{C,C^♯,D, . . . ,B}that consists of the twelve pitch spelling attributes as used in Western music notation. Note that in the equal-tempered scale, different pitch spellings such C^♯and D^♭refer to the same chroma. A chroma vector can be represented as a 12-dimensional vectorx= (x(1), x(2), . . . , x(12))^T, where x(1) corresponds to chroma C, x(2) to chroma C^♯, and so on. Representing the short-time energy content of the signal in each of the 12 pitch classes, chroma features do not only account for the close octave relationship in both melody and harmony as it is prominent in Western music, but also introduce a high degree of robustness to variations in timbre and articulation [4]. Furthermore, normalizing the features makes them invariant to dynamic variations.

It is straightforward to transform a MIDI representation into a chroma representation or chromagram. Using the explicit MIDI pitch and timing information one basically identifies

0 0.1 0.2 0.3 0.4 0.5

−60

−40

−20 0

0 0.1 0.2 0.3 0.4 0.5

−60

−40

−20 0

Frequencyω

69.5 93.5

69 93

Figure 5.2: Magnitude responses in dB for some of the pitch filters of the multirate pitch filter bank used for the chroma computation. Top: Filters corresponding to MIDI pitches p∈[69 : 93]

(with respect to the sampling rate 4410 Hz). Bottom: Filters shifted half a semitone upwards.

pitches that belong to the same chroma class within a sliding window of a fixed size, see [90]. Disregarding information on dynamics, we derive a binary chromagram assuming only the values 0 and 1.⁴ Furthermore, dealing with monophonic tunes, one has for each frame at most one non-zero chroma entry that is equal to 1. Figure 5.1 shows various representations for the folk song OGL27517. Figure 5.1b shows a chromagram of a MIDI reference corresponding to the score shown in Figure 5.1a. In the following, the chromagram of the reference transcription is referred to asreference chromagramorMIDI chromagram.

For transforming an audio recording into a chromagram, one has to revert to signal pro-cessing techniques. There are many ways of computing and enhancing chroma features, which results in a large number of chroma variants with different properties [4; 58; 59;

123]. Most chroma implementations are based on short-time Fourier transforms in com-bination with binning strategies [4; 59]. We use chroma features obtained from a pitch decomposition using a multirate filter bank as described in [123]. A given audio signal is first decomposed into 88 frequency bands with center frequenciesf_p corresponding to the pitchesA0 toC8 (MIDI pitches p= 21 top= 108), where

f_p = 440 Hz·2¹²^p . (5.1)

Then, for each subband, we compute the short-time mean-square power (i. e., the samples of each subband output are squared) using a rectangular window of a fixed length and an overlap of 50 %. In the following, we use a window length of 200 milliseconds leading to a feature rate of 10 Hz (10 features per second). The resulting features measure the local energy content of each pitch subband and indicate the presence of certain musical notes within the audio signal, see [123] for further details.

The employed pitch filters possess a relatively wide passband, while still properly sep-arating adjacent notes thanks to sharp cutoffs in the transition bands, see Figure 5.2.

4Information about note intesities is not captured by the reference transcriptions.

Actually, the pitch filters are robust to deviations of up to ±25 cents⁵ from the re-spective note’s center frequency. The pitch filters will play an important role in the following sections. We then obtain a chroma representation by simply adding up the corresponding values that belong to the same chroma. To archive invariance in dy-namics, chroma vectors are normalized with respect to the Euclidean norm (signal en-ergy). The resulting chroma features are further processed by applying suitable quan-tization, smoothing, and downsampling operations resulting in some enhanced chroma features referred to as CENS (Chroma Energy Normalized Statistics). An implementa-tion of these features is available online⁶ and described in [127]. Adding a further degree of abstraction by considering short-time statistics over energy distributions within the chroma bands, CENS features constitute a family of scalable and robust audio features and have turned out to be very useful in audio matching and retrieval applications [135;

105]. These features allow for introducing a temporal smoothing. To this end, feature vec-tors are averaged using a sliding window technique depending on a window size denoted by w (given in frames) and a downsampling factor denoted by d, see [123] for details. In our experiments, we average feature vectors over a window corresponding to one second of the audio and a feature resolution of 10 Hz (10 features per second). Figure 5.1c shows the resulting smoothed version of the reference (MIDI) chromagram shown in Figure 5.1b.

Figure 5.1d shows the final smoothed chromagram (CENS) for one of the five stanzas of the audio recording. For technical details, we refer to the cited literature.

In document Signal processing methods for beat tracking, music segmentation, and audio retrieval (sider 80-83)