• No results found

Part I Beat Tracking and Tempo Estimation 9

2.3 Novelty Curve

Our PLP concept is based on a novelty curve as typically used for note onset detection tasks. We now describe the approach for computing novelty curves used in our experi-ments. In our variant, we combine ideas and fundamental concepts of various

state-of-the-Time (s)

Figure 2.2: Illustration of the estimation of optimal periodicity kernels. (a)Novelty curve ∆.

(b)Magnitude tempogram |T |with maxima (indicated by circles) shown at five time positionst.

(c)Optimal sinusoidal kernelsκt(using a kernel size of 3 seconds) corresponding to the maxima.

Note how the kernels capture the local peak structure of the novelty curve in terms of frequency and phase.

art methods [7; 100; 102; 189]. Our novelty curve is particularly designed for also revealing meaningful note onset information for complex music, such as orchestral pieces dominated by string instruments. Note, however, that the particular design of the novelty curve is not the focus of this thesis. The mid-level representations as introduced in the following are designed to work even for noisy novelty curves with a poor peak structure. Naturally, the overall result may be improved by employing more refined novelty curves as suggested in [88; 189; 50].

Recall from Section 2.1 that a note onset typically goes along with a sudden change of the signal’s energy and spectral content. In order to extract such changes, given a music

0 1 2 3 4 5 6 0

0.5 1

0 1 2 3 4 5 6

−1

−0.5 0 0.5 1

0 1 2 3 4 5 6

0 0.5 1

(a)

(b)

(c)

Time (sec)

Figure 2.3: Illustration of the PLP computation from the optimal periodicity kernels shown in Figure 2.2c. (a)Novelty curve ∆. (b)Accumulation of all kernels (overlap-add). (c)PLP curve Γ obtained after half-wave rectification.

recording, a short-time Fourier transform is used to obtain a spectrogram X= (X(k, t))k,t

with k ∈ [1 : K] and t ∈ [1 : T]. Here, K denotes the number of Fourier coefficients, T denotes the number of frames, and X(k, t) denotes thekth Fourier coefficient for time frame t. In our implementation, the discrete Fourier transforms are calculated over Hann-windowed frames of length 46 ms with 50% overlap. Consequently, each time parameter t corresponds to 23 ms of the audio recording.

Note that the Fourier coefficients of X are linearly spaced on the frequency axis. Using suitable binning strategies, various approaches switch over to a logarithmically spaced fre-quency axis, e. g., by using mel-frefre-quency bands or pitch bands, see [100]. Here, we keep the linear frequency axis, since it puts greater emphasis on the high-frequency regions of the signal, thus accentuating noise bursts that are typically visible in the high-frequency spectrum. Similar strategies for accentuating the high frequency content for onset detec-tion are proposed in [118; 23].

In the next step, we apply a logarithm to the magnitude spectrogram |X| of the signal yielding

Y := log(1 +C· |X|)

for a suitable constant C >1, see [100; 102]. Such a compression step not only accounts for the logarithmic sensation of sound intensity but also allows for adjusting the dynamic range of the signal to enhance the clarity of weaker transients, especially in the high-frequency regions. In our experiments, we use the valueC= 1000, but our results as well as the findings reported by Klapuri et al. [102] show that the specific choice ofC does not effect the final result in a substantial way. The effect of this compression step is illustrated by Figure 2.4 for a recording of Beethoven’s Fifth Symphony. Figure 2.4a shows the piano reduced version of the first 12 measures of the score. The audio recording is an orchestral version conducted by Bernstein. Figure 2.4c shows the magnitude spectrogram |X| and

4

Figure 2.4: First 12 measures of Beethoven’s Symphony No. 5 (Op. 67). (a)Score representation (in a piano reduced version). (b) Annotated reference onsets (for an orchestral audio record-ing conducted by Bernstein). (c) Magnitude spectrogram |X|. (d)Logarithmically compressed magnitude spectrogramY. (e)Novelty curve ¯∆ and local mean (red curve). (f )Novelty curve ∆.

Figure 2.4d the compressed spectrogramY usingC= 1000. As a result of the logarithmic compression, events with low intensities are considerably enhanced inY, especially in the high frequency range.

To obtain a novelty curve, we basically apply a first order differentiator to compute the discrete temporal derivative of the compressed spectrum Y. In the following, we only consider note onsets (positive derivative) and not note offsets (negative derivative).

Therefore, we sum up only over positive intensity changes to obtain the novelty function

0 2 4 6 8 10 12 0

0.5 1 1.5

0 2 4 6 8 10 12

0

0 2 4 6 8 10 12

0 0.5 1 1.5

Time (sec)

(a)

(b)

(c)

Figure 2.5: Illustrating the effect of the logarithmic compression on the resulting novelty curves.

(a)Novelty curve based on the magnitude spectrogram|X|(see Figure 2.4c). (b)Manually anno-tated reference onsets. (c) Novelty curve ∆ based on the logarithmically compressed magnitude spectrogramY (see Figure 2.4d).

∆ : [1 :¯ T −1]→R:

∆(t) :=¯

K

X

k=1

|Y(k, t+ 1)−Y(k, t)|≥0. (2.1) for t ∈ [1 : T −1], where |x|≥0 := x for a non-negative real number x and |x|≥0 := 0 for a negative real number x. Figure 2.4e shows the resulting curve for the Beethoven example. To obtain our final novelty function ∆, we subtract the local mean (red curve in Figure 2.4e) from ¯∆ and only keep the positive part (half-wave rectification), see Fig-ure 2.4f. In our implementation, we actually use a higher-order smoothed differentiator [2].

Furthermore, we process the spectrum in a bandwise fashion using 5 bands. Similar as in [154] these bands are logarithmically spaced and non-overlapping. Each band is roughly one octave wide. The lowest band covers the frequencies from 0 Hz to 500 Hz, the highest band from 4000 Hz to 11025 Hz. The resulting 5 novelty curves are summed up to yield the final novelty function.

The resulting novelty curve for our Beethoven example reveals the note onset candidates in the form of impulse-like spikes. Actually, this piece constitutes a great challenge for onset detection as, besides very dominant note onsets in the fortissimo section at the beginning of the piece (measures 1-5), there are soft and blurred note onsets in the piano section which is mainly played by strings (measures 6-12). This is also reflected by the novelty curve shown in Figure 2.4f. The strong onsets in the fortissimo section result in very pronounced peaks. The soft onsets in the piano section (seconds 8-13), however, are much more difficult to be distinguished from the spurious peaks not related to any note onsets.

In this context, the logarithmic compression plays a major role. Figure 2.5 compares the novelty curve ∆ with a novelty curve directly derived from the magnitude spectrogram

|X|without applying a logarithmic compression. Actually, omitting the logarithmic

com-pression (Figure 2.5a) results in a very noisy novelty curve that does not reveal musically meaningful onset information in the piano section. The novelty curve ∆ (Figure 2.5b), however, still possesses a regular peak structure in the problematic sections. This clearly illustrates the benefits of the compression step. Note that the logarithmic compression of the spectrogram gives higher weight to an absolute intensity difference within a quiet region of the signal than within a louder region, which follows the psychoacoustic prin-ciple that a just-noticeable change in intensity is roughly proportional to the absolute intensity [51]. Furthermore, the compression leads to a better temporal localization of the onset, because the highest relative slope of the attack phase approaches the actual onset position and noticeably reduces the influence of amplitude changes (e.g. tremolo) in high intensity regions. Further examples of our novelty curve are discussed in Section 2.7.

The variant of a novelty curve described in this section combines important design princi-ples and ideas of various approaches proposed in the literature. The basic idea of consid-ering temporal differences of a spectrogram representation is well known from thespectral flux novelty curve, see [7]. This strategy works particularly well for percussive note onsets but is not suitable for less pronounced onsets (see Figure 2.5a). One well known variant of the spectral flux strategy is thecomplex domain method as proposed in [8]. Here, magni-tude and phase information is combined in a single novelty curve to emphasize weak note onsets and smooth note transitions. In our experiments, the logarithmic compression has a similar effect as jointly considering magnitude and phase, but showed more robust results in many examples. Another advantage of our approach is that the compression constant C allows for adjusting the compression. The combination of magnitude compression and phase information did not lead to a further increase in robustness.