A User Interface for Folk Song Navigation

Part II Music Segmentation 65

7.3 A User Interface for Folk Song Navigation

Chroma templates capture performance aspects and variations in the various stanzas of a folk song. In particular, the chroma templates give a visual impression of the variations occurring. We now present a user interface that allows for analyzing such variations by means of listening to and, in particular, comparing the different stanzas of an audio recording in a convenient way.

Once having segmented the audio recording into stanzas and computed alignment paths

between the MIDI reference and all audio stanzas, one can then derive the temporal correspondences between the MIDI and the audio representation with the objective to associate note events given by the MIDI file with their physical occurrences in the audio recording, see [123] for details. The result can be regarded as an automated annotation of the entire audio recording with available MIDI events. Such annotations facilitate multimodal browsing and retrieval of MIDI and audio data, thus opening new ways of experiencing and researching music. For example, most successful algorithms for melody-based retrieval work in the domain of symbolic or MIDI music. On the other hand, retrieval results may be most naturally presented by playing back the original recording of the melody, while a musical score or a piano-roll representation may be the most appropriate form for visually displaying the query results. For a description of such functionalities, we refer to [26].

Furthermore, aligning each stanza of the audio recording to the MIDI reference yields a multi-alignment between all stanzas. Exploiting this alignment, one can implement inter-faces that allow a user to seamlessly switch between the various stanzas of the recording thus facilitating a direct access and comparison of the audio material [123]. The Audio Switcher [57] constitutes such a user interface, which allows the user to open in parallel a synthesized version of the MIDI reference as well as all stanzas of the folk song recording, see Figure 7.9. Each of the stanzas is represented by a slider bar indicating the current playback position with respect to the stanza’s particular time scale. The stanza that is cur-rently used for audio playback, in the following referred to as active stanza, is indicated by a red marker located to the left of the slider bar. The slider knob of the active stanza moves at constant speed while the slider knobs of the other stanzas move accordingly to the rela-tive tempo variations with respect to the acrela-tive stanza. The acrela-tive stanza may be changed at any time simply by clicking on the respective playback symbol located to the left of each slider bar. The playback of the new active stanza then starts at the time position that musically corresponds to the last playback position of the former active stanza. This has the effect of seamlessly crossfading from one stanza to another while preserving the current playback position in a musical sense. One can also jump to any position within any of the stanzas by directly selecting a position of the respective slider. Such functionalities assists the user in detecting and analyzing the differences between several recorded stanzas of a single folk song. The Audio Switcher is realized as plug-in of the SyncPlayer system [106;

57], which is an an advanced software audio player with a plug-in interface for MIR appli-cations and provides tools for navigating within audio recordings and browsing in music collections. For further details and functionalities, we refer to the literature.

7.4 Conclusion

In this chapter, we presented a multimodal approach for extracting performance param-eters from folk song recordings by comparing the audio material with symbolically given reference transcriptions. As the main contribution, we introduced the concept of chroma templates that reveal the consistent and inconsistent melodic aspects across the various stanzas of a given recording. In computing these templates, we used tuning and time warping strategies to deal with local variations in melody, tuning, and tempo.

Figure 7.9: Instance of the Audio Switcher plug-in of the SyncPlayer showing the synthesized version of the MIDI reference and the five different stanzas of the audio recording of OGL27517.

The variabilities across the various stanzas of a given recording revealed and observed in this chapter may have various causes, which need to be further explored in future research.

Often these causes are related to questions in the area of music cognition. A first hypothesis is that stable notes are structurally more important than variable notes. The stable notes may be the ones that form part of the singer’s mental model of the song, whereas the variable ones are added to the model at performance time. Variations may also be caused by problems in remembering the song. It has been observed that often melodies stabilize after the singer performed a few iterations. Such effects may offer insight in the working of the musical memory. Furthermore, melodic variabilities caused by ornamentations can also be interpreted as a creative aspect of performance. Such variations may be motivated by musical reasons, but also by the lyrics of a song. Sometimes song lines have an irregular length, necessitating the insertion or deletion of notes. Variations may also be introduced by the singer to emphasize key words in the text or, more general, to express the meaning of the song. Finally one may study details on tempo, timing, pitch, and loudness in relation to performance, as a way of characterizing performance styles of individuals or regions.

Audio Retrieval

A Review of Content-Based Music Retrieval

The way music is stored, accessed, distributed, and consumed underwent a radical change in the last decades. Nowadays, large collections containing millions of digital music documents are accessible from anywhere around the world. Such a tremendous amount of readily available music requires retrieval strategies that allow users to ex-plore large music collections in a convenient and enjoyable way. Most audio search en-gines rely on metadata and textual annotations of the actual audio content [19]. Edi-torial metadata typically include descriptions of the artist, title, or other release infor-mation. The drawback of a retrieval solely based on editorial metadata is that the user needs to have a relatively clear idea of what he or she is looking for. Typical query terms may be a title such as “Act naturally” when searching the song by The Beatles or a composer’s name such as “Beethoven” (see Figure 8.1a).¹ In other words, tradi-tional editorial metadata only allow to search for already known content. To overcome these limitations, editorial metadata has been more and more complemented by gen-eral and expressive annotations (so called tags) of the actual musical content [10; 97;

173]. Typically, tags give descriptions of the musical style or genre of a recording, but may also include information about the mood, the musical key, or the tempo [110; 172]. In par-ticular, tags form the basis for music recommendation and navigation systems that make the audio content accessible even when users are not looking for a specific song or artist but for music that exhibits certain musical properties [173]. The generation of such annotations of audio content, however, is typically a labor intensive and time-consuming process [19;

172]. Furthermore, often musical expert knowledge is required for creating reliable, consis-tent, and musically meaningful annotations. To avoid this tedious process, recent attempts aim at substituting expert-generated tags by user-generated tags [172]. However, such tags tend to be less accurate, subjective, and rather noisy. In other words, they exhibit a high degree of variability between users. Crowd (or social) tagging, one popular strategy in this context, employs voting and filtering strategies based on large social networks of users for

“cleaning” the tags [110]. Relying on the “wisdom of the crowd” rather than the “power of the few” [99], tags assigned by many users are considered more reliable than tags assigned

1www.google.com (accessed Dec. 18, 2011)

105

(a) (b) (c)

Figure 8.1: Illustration of retrieval concepts. (a)Traditional retrieval using textual metadata (e. g., artist, title) and a web search engine. (b)Retrieval based on rich and expressive metadata given by tags. (c)Content-based retrieval using audio, MIDI, or score information.

by only a few users. Figure 8.1b shows the Last.fm² tag cloud for “Beethoven”. Here, the font size reflects the frequency of the individual tags. One major drawback of this approach is that it relies on a large crowd of users for creating reliable annotations [110]. While mainstream pop/rock music is typically covered by such annotations, less popular genres are often scarcely tagged. This phenomenon is also known as the “long-tail” problem [20;

172]. To overcome these problems, content-based retrieval strategies have great poten-tial as they do not rely on any manually created metadata but are exclusively based on the audio content and cover the entire audio material in an objective and repro-ducible way [19]. One possible approach is to employ automated procedures for tagging music, such as automatic genre recognition, mood recognition, or tempo estimation [9;

173]. The major drawback of these learning-based strategies is the requirement of large corpora of tagged music examples as training material and the limitation to queries in tex-tual form. Furthermore, the quality of the tags generated by state-of-the-art procedures does not reach the quality of human generated tags [173].

In this chapter, we present and discuss various retrieval strategies based on audio content that follow the query-by-example paradigm: given an audio recording or a fragment of it (used as query or example), the task is to automatically retrieve documents from a given music collection containing parts or aspects that are similar to it. As a result, retrieval systems following this paradigm do not require any textual descriptions. However, the notion of similarity used to compare different audio recordings (or fragments) is of crucial importance and largely depends on the respective application as well as the user requirements. Such strategies can be loosely classified according to their specificity, which refers to the degree of similarity between the query and the database documents.

The remainder of this chapter is organized as follows. In Section 8.1, we first give an overview on the various audio retrieval tasks following the query-by-example paradigm.

In particular, we extend the concept of specificity by introducing a second aspect: the granularity of a retrieval task referring to the temporal scope. Then, we discuss repre-sentative state-of-the-art approaches to audio identification (Section 8.2), audio matching (Section 8.3), and version identification (Section 8.4). In Section 8.5, we discuss open problems in the field of content-based retrieval and give an outlook on future directions.

2www.last.fm (accessed Dec. 18, 2011)

Audio identification Audio fingerprinting Plagiarism detection Copyright monitoring

Audio matching Remix / Remaster retrieval

Cover song detection

Year / epoch discovery Key / mode discovery

Loudness-based retrieval

Tag / metadata inference Mood classification Genre / style similarity

Instrument-based retrieval

Music / speech segmentation

Recommendation n

Instrument based retrieval

Music / speech se mentation

Year / epoch discovery Music / speech se mentatio

ment-based retrieva

Music / speech se mentatio

etaddattta inference asseedd rree

Remix / Reemaster retrieval

over song e ec on

Pllagiaarism detectionarism detection

Figure 8.2: Specificity/granularity pane showing the various facets of content-based music re-trieval.

In document Signal processing methods for beat tracking, music segmentation, and audio retrieval (sider 110-117)