Predictive songwriting with concatenative accompaniment
Benedikte Wallace
Thesis submitted for the degree of Master in Programming and Networks
60 credits
Department of Informatics
Faculty of mathematics and natural sciences
UNIVERSITY OF OSLO
Predictive songwriting with concatenative accompaniment
Benedikte Wallace
© 2018 Benedikte Wallace
Predictive songwriting with concatenative accompaniment http://www.duo.uio.no/
Printed: Reprosentralen, University of Oslo
Abstract
Musicians often use tools such as loop-pedals and multitrack recorders to assist in improvisation and songwriting. While these devices are useful in creating new compositions from scratch, they do not contribute to the composition directly. In recent years, new musical instruments, interfaces and controllers using machine learning algorithms to create new sounds, generate accompaniment or construct novel compositions, have become available for both professional musicians and novices to enjoy.
This thesis describes the design, implementation and evaluation of a sys- tem for predictive songwriting and improvisation using concatenative ac- companiment which has been given the nickname PSCA. In its most simple form, the PSCA functions as an audio looper for vocal improvisation, but the system also utilises machine learning approaches to predict suitable harmonies to accompany the playback loop. Two machine learning algo- rithms were compared and implemented into the PSCA to facilitate har- mony prediction: the hidden Markov model (HMM) and the Bidirectional Long Short-Term Memory (BLSTM). The HMM and BLSTM algorithms are trained on a dataset of lead sheets in order to learn the relationship be- tween the notes in a melody and the chord which accompanies it as well as learning dependencies between chords to model chord progressions. In quantitative testing, the BLSTM model was found to be able to learn the harmony prediction task more effectively than the HMM model, this was also supported by a qualitative analysis of musicians using the PSCA sys- tem.
The system proposed in this thesis provides a novel approach in which these two machine learning models are compared with regards to prediction accuracy on the dataset as well as the perceived musicality of each model when used for harmony prediction in the PSCA. This approach results in a system which can contribute to the improvisation and songwriting process by adding harmonies to the audio loop on-the-fly.
Contents
1 Introduction 1
1.1 Motivation . . . 1
1.2 Goals . . . 2
1.2.1 Songwriting . . . 3
1.2.2 Accompaniment . . . 3
1.2.3 Concatenation . . . 4
1.2.4 Prediction . . . 4
1.3 Machine learning approaches . . . 5
1.4 Research Methods . . . 6
1.4.1 Quantitative model evaluation . . . 6
1.4.2 Qualitative model evaluation . . . 7
1.4.3 User study . . . 7
1.5 Outline . . . 7
2 Background 9 2.1 Programming and music . . . 9
2.2 NIMEs . . . 11
2.3 Music Information Retrieval . . . 13
2.4 Musical Artificial Intelligence . . . 15
3 Sequence Classifiers 19 3.1 Harmony prediction . . . 19
3.2 Hidden Markov Model . . . 20
3.2.1 Markov Chains . . . 20
3.2.2 Hidden Events . . . 21
3.2.3 Decoding . . . 22
3.2.4 Applying HMM to Harmony Prediction . . . 23
3.3 Recurrent Neural Nets . . . 23
3.3.1 Learning . . . 24
3.3.2 LSTM: Long Short-Term Memory . . . 24
3.3.3 Bidirectional RNN . . . 25
3.3.4 Applying BLSTM for Harmony Prediction . . . 26
4 Design and Implementation 27 4.1 The audio looper . . . 27
4.2 Program structure . . . 29
4.2.1 Pitch analysis . . . 29
4.2.2 The note bank . . . 31
4.2.3 Chord construction . . . 31
4.2.4 Creating note vectors . . . 32
4.2.5 Base case implementation . . . 34
4.3 Machine learning approaches . . . 34
4.3.1 Dataset . . . 34
4.3.2 Pre-processing . . . 36
4.3.3 Chord types . . . 36
4.3.4 HMM model . . . 37
4.3.5 BLSTM model . . . 39
4.3.6 Discussion . . . 42
4.4 Summary . . . 43
5 Testing and evaluation 45 5.1 Model Validation . . . 45
5.1.1 Motivations . . . 45
5.1.2 Methods . . . 47
5.1.3 Results . . . 47
5.1.4 Discussion . . . 50
5.2 Qualitative Model Analysis . . . 50
5.2.1 Methods . . . 50
5.2.2 Results . . . 50
5.2.3 Discussion . . . 55
5.2.4 Handling imbalanced datasets . . . 55
5.3 Production Model Analysis . . . 57
5.3.1 Heuristic BLSTM . . . 58
5.3.2 Heuristic HMM . . . 58
5.3.3 Results . . . 59
5.4 User experience study . . . 62
5.4.1 Motivations . . . 62
5.4.2 Session overview . . . 63
5.4.3 Data collection . . . 64
5.4.4 Data analysis . . . 65
5.4.5 Results . . . 67
5.4.6 Discussion . . . 67
6 Conclusion and future work 69 6.1 Overview of results . . . 71
6.2 Future work . . . 72
6.2.1 Improving the audio looper and concatenation process 72 6.2.2 Improving the PSCA controller . . . 72
6.2.3 Harmony prediction . . . 72
6.3 Final remarks . . . 74
A Music Concepts 77 A.1 Music concepts . . . 77
B Written feedback from UX study & participant form 79 B.1 Reflective feedback participant 1: . . . 79 B.2 Reflective feedback participant 2: . . . 79 B.3 User experience study participant form . . . 80
List of Figures
1.1 The PSCA system setup: laptop running PSCA software, sound card, microphone, headset and Arduino foot switch
controller. Photo: Benedikte Wallace . . . 1
1.2 Common tools to aid in songwriting and improvisation . . . 3
1.3 Melody with missing chord notation, the machine learning algorithms used in the PSCA are trained to predict a suitable chord for each bar. . . 4
1.4 Songwriter with computer assistance . . . 5
1.5 PSCA system overview: Audio recorded by the user is added to the playback loop together with a chord selected by the harmony prediction models. The harmony is constructed using concatenated segments of the recorded audio. . . 6
2.1 First attempt at implementing an audio looper for the PSCA using PureData. Photo: Benedikte Wallace . . . 11
2.2 Results reprinted from Nayebi and Vitelli’s paperGRUV: Al- gorithmic Music Generation using Recurrent Neural Networks[53] 17 2.3 Open NSynth Super physical interface for the NSynth algorithm. [43] . . . 18
3.1 The harmony prediction task: Given a set of consecutive measures containing a monophonic melody, select a suitable harmony (qn) for each measure. . . 19
3.2 A Markov model for transitions between four chords. Tran- sition and entry probabilities are omitted. . . 20
3.3 HMM chord states and note emissions . . . 22
3.4 RNN . . . 24
3.5 RNN unrolled over time . . . 24
3.6 LSTM cell unrolled over time . . . 25
3.7 Figure showing the three gates of a single LSTM cell. . . 26
4.1 The PSCA system consists of software and hardware compo- nents. The Python scripts for running the PSCA system are available at https://github.com/benediktewallace/PSCA. Photos: Benedikte Wallace . . . 27
4.2 System diagram of the hardware setup for using the PSCA system: Laptop, sound card, microphone, Arduino and foot switches . . . 28
4.3 First prototype for controlling playback and recording with the PSCA. Photo: Benedikte Wallace . . . 28 4.4 Program flow for the PSCA: User records audio which in
turn is analysed, segmented and added to the note bank.
The harmony prediction model predicts a chord, this chord is constructed by concatenating and adding audio segments from the note bank. Finally this concatenated audio layer is added to the playback loop together with the latest recording made by the user. . . 30 4.5 librosa piptrack function. Definition and example of use . . 31 4.6 Mappings of notes to integers from 0 - 11. Highlighted is a C
major triad . . . 32 4.7 Example of measure and resulting note vector . . . 32 4.8 Note vectors created when singing one 4.8(a), two 4.8(b) and
three notes 4.8(c) . . . 33 4.9 Example of a lead sheet from the dataset: Let Me Call You
Sweetheart(Freidman and Whitson, 1910) . . . 35 4.10 Transition probabilities (A) and emission probabilities (B) for
major-key and minor-key datasets as generated by the HMM. 38 4.11 Implementation of the bidirectional LSTM using Keras . . . 40 4.12 The sample function for reweighting and sampling a proba-
bility distribution as presented in listing 8.6 inDeep learning with Python[16] . . . 41 5.1 K-fold cross validation accuracy using 24 and 60-chord
datasets. The BLSTM model had the highest median accuracy over all tests, it also had a much narrower inter- quartile range than the HMM models, suggesting that it is less sensitive to changes in the dataset. . . 49 5.2 Confusion Matrix for HMM and BLSTM using minor key
dataset with 24 unique chords. These results show that the HMM misclassifies most samples as belonging to classes C, G or Am while the matrix pertaining to the BLSTM model shows a clear diagonal line where the model has predicted the correct class. . . 51 5.3 Confusion Matrix for HMM and BLSTM using major key
dataset with 24 unique chords. These results show that the HMM misclassifies most samples as belonging to classes C, G or F while the matrix pertaining to the BLSTM model shows a clear diagonal line where the model has predicted the correct class. . . 51 5.4 Confusion Matrix for HMM and BLSTM using mixed dataset
with 24 unique chords. The predictions generated by the HMM using the mixed dataset are skewed towards C, F and G. The results are similar to those generated by the HMM using only songs in major keys. As with the major and minor song datasets, the BLSTM achieves better prediction accuracy than the HMM on the mixed dataset. . . 52
5.5 Confusion Matrix for HMM and BLSTM using minor key dataset with 60 unique chords. These figures show that the HMM predictions are skewed towards C, some misclassifi- cations also fall into classes Am and Asus. Predictions gen- erated by the BLSTM mostly belong to the major and mi- nor chord types. Other chord types such as suspended, aug- mented and so on, are rarely generated by either model . . . 53 5.6 Confusion Matrix for HMM and BLSTM using major key
dataset with 60 unique chords. These figures show that when using major key songs only, predictions from both models are skewed towards C, G and F. . . 53 5.7 Confusion Matrix for HMM and BLSTM using mixed dataset
with 60 unique chords. As with the 24-chord mixed dataset, the HMM predictions are similar to those generated with the major key dataset. The BLSTM model generates more varied predictions, and seems to model the underlying mixed dataset more precisely. . . 54 5.8 Prediction distribution for BLSTM and HMM trained on
24-chord mixed dataset. The figures show that the HMM predicts class C more than 80% of the time. Predictions generated by the BLSTM model show a distribution similar to the true chord distribution of the 24-chord mixed dataset (shown in figure 5.9(a)). . . 54 5.9 Distribution of the chords found in the 24-chord and 60-
chord mixed datasets . . . 56 5.10 Distribution of chords generated by heuristic BLSTM . . . . 59 5.11 Distribution of chords generated by heuristic HMM . . . 60 5.12 Examples of different chord sequences as predicted by the
models trained for use in the PSCA, as well as the original chord sequence. Both the HMM and BLSTM predict suitable chords for each bar. Although the naive, base case approach results in acceptable choices for the first and third measure, the second and fourth chords create dissonance. This shows how the naive approach generates unpredictable chord progressions. . . 61 5.13 Performers interacting with the PSCA system, a video is
available on YouTube: https://www.youtube.com/watch?v=t-
owEHRYmi4and Zenodo:https://zenodo.org/record/1214505#.WumcwMixXVM 62 6.1 Evolution of the PSCA system . . . 75
List of Tables
4.1 Examples of original chords found in the lead sheets and the resulting triad label used when simplifying to closest triad . 37 4.2 Example of predictions sampled at different temperatures . 42 5.1 Lead sheet groupings and resulting datasets. The models are
trained on each of the six datasets in order to compare the 60- chord vs. 24-chord approach as well as the consequences of segmenting the dataset into separate sets of major and minor
key songs . . . 46
5.2 24-chord models: Average accuracy fork-fold cross valida- tion withk= 10. . . 48
5.3 60-chord models: Average accuracy fork-fold cross valida- tion withk= 10. . . 48
5.4 User experience study: Session structure . . . 64
5.5 Concept codes for feedback analysis . . . 65
5.6 Feedback and labels from UX study . . . 66
5.7 User experience study: Session details . . . 67
Abbreviations
BLSTM Bidirectional Long Short-Term Memory. i, xi, 5, 7, 26, 32, 34, 39–43, 45, 47, 48, 50, 54–59, 61, 66, 67, 69, 70, 73
CNN Convolutional Neural Network. xi, 17
HMM hidden Markov model. i, xi, 5, 7, 15, 16, 18–20, 22, 23, 32, 34, 37, 38, 42, 43, 45, 47, 48, 50, 54, 56–59, 61, 66, 69, 70, 73
LSTM Long Short-Term Memory. xi, 5, 16–18, 25, 26, 39 MIR Music Information Retrieval. xi, 9, 13–15
ML machine learning. xi, 5, 7, 12, 69, 72, 74, 75
NIME New Interfaces for Musical Expression. xi, 9, 12, 13 RNN Recurrent Neural Network. xi, 5, 16, 18, 19, 23–26, 34, 39 TA Thematic Analysis. xi, 7, 64
UX User Experience. xi, 7, 45, 61, 62
Acknowledgement
I would like to extend my deepest gratitude to my supervisor, postdoctoral fellow Charles Martin for his continuous support, inspiration and guidance throughout the thesis work. I would also like to thank my family, especially my talented sister, jazz singer and composer Karoline Wallace, for sharing her musical expertise and my significant other Magnus B. Skarpheðinsson for his endless support.
Chapter 1
Introduction
Figure 1.1: The PSCA system setup: laptop running PSCA software, sound card, microphone, headset and Arduino foot switch controller. Photo:
Benedikte Wallace
1.1 Motivation
Examining the intersection between science and art has always fascinated me and driven my studies. Studying machine learning has allowed me to delve deeper and participate in exciting research in this field as we
see machine learning being used for many different tasks in music, from building new musical instruments and creating sound emulators, to music generation. In this thesis I propose a novel system which applies machine learning methods to predict suitable harmonies to accompany a real-time audio looper for vocal improvisation. The system has been nicknamed PSCA, referring to the acronym of the thesis title.
During the songwriting process, musicians often use tools such as multitrack recorders and loop-pedals to assist in improvisation and trying out new musical ideas. Today’s multitrack recorders are often referred to as digital audio workstations (DAWs). DAWs such as the software shown in figure 1.2(b) allow the user to record and edit several recordings to construct a piece of music. Although offering a great deal of control and a range of different audio processing tools, creating music this way requires knowledge of the DAW software and can be a difficult task for novices.
Using a loop-pedal enables the musician to layer and loop recordings on- the-fly. By adding layers, a single performer can create complex harmonies, rhythms and dynamics. Since the loop-pedal provides the musician with a way to record new layers without stopping the playback of the previous recordings, loopers such as the Boss RC-30 presented in figure 1.2(a), are often used in live performance.
The challenge for many performers and songwriters when working with such tools is that they have to create every piece of sound that is used, either by recording or synthesising them directly or by relying on libraries con- taining synthesised instruments. This can make the songwriting process uninspiring as well as time consuming. If instead, accompaniment could be generated using the performers’ own creative output, this may enhance the songwriting process. Furthermore, if additional accompaniment could be produced by use of machine learning methods this would create novel contributions to the composition, inspired, but not directly generated, by the performer.
1.2 Goals
I developed the PSCA system to address these problems in the following ways: the PSCA is, in its simplest form, an audio looper, but the PSCA goes beyond the playback of live recordings by generating accompaniment in the form of harmonies that suit the recorded melody. This adds a new dimension to the standard audio looping functions. Also, these harmonies are constructed using the recordings of the musician’s own voice instead of using pre-recorded or synthesised instruments. This provides the user with automatic accompaniment with an interesting sound. In the following subsections I have outlined how the system relates to each of the concepts contained in the acronym, PSCA.
1.2.1 Songwriting
As mentioned above, my work is based on creating an audio looper for vocals. The user sings into a microphone and the system loops the recording, adding new layers to the playback loop as the user records them.
These audio looping functions are implemented in Python and controlled using a pair of foot switches and an Arduino. The goal of the PSCA system is to provide a similarly simple interface to looping audio as in the "Loop Station" pedal, but to harness some of the power of computer-based audio as in a DAW. Using the PSCA system a musician can improvise or test out new ideas to assist in the songwriting process.
(a) Boss RC30 Loop Station. Photo:
Copyright ©Roland Corporation
(b) Ableton Live DAW for multitrack recording. Photo: Benedikte Wallace
Figure 1.2: Common tools to aid in songwriting and improvisation
1.2.2 Accompaniment
In addition to a standard audio looping function, the PSCA system generates chords to accompany the audio recording. This creates an underlying harmony that changes as the user records new layers to the loop. Several systems exist today to aid in songwriting and improvisation by generating additional accompaniment to the user input, for example, Microsoft SongSmith [71]. The user of the SongSmith system records a vocal melody by following a metronome. After this recording has been analysed the system plays it back with added harmonies. The user can also set certain variables such as the preferred genre, the amount of happy and sad chords (reflecting the major or minor feel of the accompaniment), and a jazz-factor which controls the complexity and predictability of the generated harmonies. Similarly, the Reflexive Looper [57] functions in much the same way as traditional loop-pedals, but is also able to provide accompaniment to a predefined song structure by arranging the musician’s recordings. Other systems such as ChordRipple [38] provide musicians with a user interface which recommends chord choices and allows the user to experiment with different chord progressions. The PSCA aims to predict
harmonies as in the SongSmith system, but use the performer’s own audio as the basis for the accompaniment, as in Reflexive Looper.
1.2.3 Concatenation
A common trait among systems that generate musical accompaniment is the use of prerecorded or synthesised instruments to generate the accompaniment. Instead of using predefined instrument libraries the PSCA system concatenates smaller slices of the recorded audio to create the desired triad. This process is similar to that of granular synthesis.
In granular synthesis pieces of audio are sampled and split into small segments (less than 1 second) referred to as grains. Curtis Roads, one of the prominent names in granular synthesis has written on the topic in his bookMicrosound [65]. The grains can be processed to change parameters such as pitch and volume, be layered on top of each other or played as a sequence. The PSCA collects 18th note segments of the recorded audio in a
“note bank”. When a chord is constructed the audio segments containing the appropriate notes are chosen from the note bank and concatenated to create an underlying harmony for the duration of the audio loop.
Figure 1.3: Melody with missing chord notation, the machine learning algorithms used in the PSCA are trained to predict a suitable chord for each bar.
1.2.4 Prediction
Though theoretically, there are no “wrong” chord choices, most people would agree that, given a melody, some chord choices create a dissonance that does not sound particularly good [9]. In order to decide precisely what chords should accompany a melody, a model needs to be able to learn relationships between the notes in a melody and the chords that are chosen to accompany them. Also, in order to model chord progression, the model needs to be able to learn temporal dependencies between chords.
This type of problem is often referred to as a sequence classification problem; the model is given a sequence of tokens and its goal is to predict the next token. In this work, this problem is referred to as "harmony prediction". An example of this problem is shown in Figure 1.3 where a melody is given with blank spaces for the harmony. A human would solve
this problem by choosing a sequence of chords that works for the melody, as well as forming a sensible harmonic sequence. The goal of PSCA is to implement chord prediction in an interactive system, further we aim to use the system to compare two machine learning approaches to this problem.
1.3 Machine learning approaches
(a) Songwriter with audio software (b) Songwriter with smart audio soft- ware such as the PSCA
Figure 1.4: Songwriter with computer assistance
Two different machine learning (ML) approaches for harmony predic- tion are compared in this work and implemented into the PSCA system, firstly a HMM, one of the most prominent methods used in similar re- search. The strategy proposed by Simon et al. in their paper on Microsoft SongSmith [71] is used as a basis for the implementation of the HMM. Si- mon et al. train an HMM on a dataset of lead sheets in order to generate chords for a recorded vocal melody. In addition, a deep learning approach to this problem was implemented. For this a Recurrent Neural Network (RNN), more specifically, a Long Short-Term Memory (LSTM) [36] was im- plemented. For the LSTM, the approach proposed by Lim et al. [46] in their paper “Chord Generation from Symbolic Melody using BLSTM Networks”
was used. Both papers, as well as my own implementations, use models that are trained on symbolic melody input and target chord labels from a dataset of lead sheets. When using the PSCA, the audio recorded by the performer is transformed into a note vector and fed to the trained models, resulting in a new chord prediction.
The research of Lim et al. does not include an interactive system, therefore there is no way to utilise the models they have created in a music system. Also, in contrast to SongSmith which uses synthesised instruments and must process the input before adding harmonies, the PSCA includes concatenative accompaniment which is generated on-the-fly. My work presents a novel approach in which the HMM and BLSTM are compared with regard to prediction accuracy on the lead sheet dataset as well as the perceived musicality of each model when used for chord generation in the PSCA. Figure 1.5 shows an overview of the PSCA system.
Figure 1.5: PSCA system overview: Audio recorded by the user is added to the playback loop together with a chord selected by the harmony prediction models. The harmony is constructed using concatenated segments of the recorded audio.
1.4 Research Methods
Multiple research methods are used in this thesis to evaluate different aspects of the PSCA system. Three studies are presented in this work.
A quantitative analysis is performed in order to assess how much each model has learned, a qualitative analysis of each model is executed to determine how good their proposed solutions are. And finally, a user study is performed which addresses the performers’ experience using the system.
In this final study the PSCA overall is evaluated, not just the individual models. As a base case for evaluating the machine learning approaches a naive approach to harmony prediction is implemented as well. This naive approach chooses the harmony based only on the pitch of the first note in each measure.
1.4.1 Quantitative model evaluation
Cross validation is used in order to evaluate how much the models have learned. The dataset is split into two disjoint sets, one set for training and one set for testing. The songs included in the training data are used to train the models, while the test set is withheld until testing. When the models are done training, the test data is used to evaluate the accuracy of the models on unseen data by comparing the predicted chords for each measure with the true chords from the test set. The training set contains approximately 80% of the data while the remaining 20% is used for testing.
1.4.2 Qualitative model evaluation
Only examining the prediction accuracy of the models is not sufficient to understand the quality of the predictions they make. Therefore, qualitative evaluations of the model’s predictions are needed. This is done using confusion matrices as well as prediction histograms, allowing for a visual representation of which chords the models are suggesting. The evaluation procedure is oriented towards identifying how well the models generate interesting, and aesthetically pleasing, as well as theoretically correct chord progressions.
1.4.3 User study
The human evaluation of the PSCA system is an equally important aspect to consider. Since the underlying mechanic of the system is an audio looper, this means the accompanying chord is not the only element affecting the overall harmony. The previously recorded layers loop in the background as well, creating more complex and slow-changing harmonies than those that would be created if the recordings were not looped and layered.
Consequently, the accuracies of the ML models on the dataset are naturally not transferable to the users experience of the different models when used in the PSCA system. The performer’s experience when working with the PSCA is used to evaluate the enjoyability and creative potential of the different modes of the PSCA system (base-case mode, BLSTM mode or HMM mode). A User Experience (UX) study with two participants was carried out, feedback that is collected from these participants has been analysed using a Thematic Analysis (TA) approach [12] which aims to identify recurring themes found in the user comments.
1.5 Outline
This thesis consists of six chapters: introduction, background, sequence classifiers, design and implementation, testing and evaluation and con- clusions and future work. Chapter 2 surveys past and related work and chapter 3 presents descriptions of the ML models used. Chapter 4 contains an overview of the design and implementation process, presents the pro- gram structure of the PSCA system and describes details on training the ML models for the harmony prediction task.
Chapter 5 outlines the experiments performed in order to evaluate the ML models and presents the results. This chapter also includes a UX study of the PSCA system. The final chapter contains a general discussion and conclusions of this thesis as well as suggestions for future work.
Chapter 2
Background
Last year, 2017, marked the 60th anniversary of the first time a piece of music was played by a computer. In 1957 Max Mathews, often referred to as the father of computer music synthesis, composed a short piece of music, which was then played by an IBM 704. In the more than 60 years that followed, the field has grown, and the development of new digital music instruments, music programming languages and instrument controllers have continued to interest developers, researchers and musicians alike.
This chapter attempts to give an overview of the different fields relevant to this thesis. First a history of programming in music is presented, followed by an overview and recent works from the New Interfaces for Musical Expression (NIME) and Music Information Retrieval (MIR) communities, and finally a closer look at machine learning in music.
2.1 Programming and music
When developing an application for music and sound creation, there are several available programming paradigms to consider. Which approach is the best fit depends on the application being made. The following section presents some of the music programming paradigms that are developed specifically to assist musicians in creating music.
The evolution of the modern computer music systems and dedicated programming languages for music generation can be said to have its be- ginning in the early programming environments of MUSIC-N, developed by Max Mathews in the late 1950s. The MUSIC-N systems introduced key concepts [82] that still influence the development of new computer music systems. The sound and music computing system Csound [45] which is widely used today is considered its direct descendant. This is also true of the Max/MSP music programming environment.
Named after Max Mathews, Max was developed by Miller S. Puckette in the mid 1980s [61]. Max/MSP is a modular music and sound developing platform allowing for the assembly of modules, or patches, for custom
sound creation and manipulation. Max/MSP works with any type of physical controller that outputs MIDI or OSC (Open Sound Control) [30].
The platform can also output these messages for controlling external, physical devices. Max/MSP remains a particularly popular tool for a wide variety of tasks ranging from visual and sonic art to construction of new music interfaces and controllers. Puckette also developed the open source environment Pure Data, shortened to Pd, in the late 1990s [62]. Pd offers a high-level programming language that allows the user to build larger modules out of smaller, basic functions as well as offering a range of convenient modules for quick assembly and testing of new ideas. The user connects modules by dragging “wires” between the in- and outputs of each module using the Pd GUI. As we enter the 2000s and the era of high performance personal computers we see an increase in on-the-fly music programming systems for use in live performance. An early example of this is ChucK, developed in 2003 by Wang and Cook [83].
These systems have in common that they are developed in order to assist musicians in music composition, and so, low-level features are often hidden from the user. This can be practical when the system is used directly for creating music, or when quick prototyping is needed, but often does not allow for control over details such as how audio is stored and parsed, and how communication between the system and other applications work.
One approach allowing for more control is to communicate to one of these languages from a more traditional programming environment (e.g., C++ or Java).
Supercollider [84], developed initially in the 1990s, uses OSC to commu- nicate between a server and client and uses its own, smalltalk-like language [49]. Today several music programming languages communicate with a Su- perCollider synthesis server which provides an ability to define arbitrary synthesisers and trigger and manipulate them in real time. One such ex- ample is SonicPi. Sonic Pi was developed to help children learn program- ming by creating music [2]. The Sonic Pi language, developed by Aaron et al. at Cambridge University, is implemented using a Ruby DSL that fa- cilitates communication with a SuperCollider server. Aaron and Blackwell also presented Overtone in 2013 [1] an elegant functional language which also communicates with a SuperCollider server. Similarly, it is possible to communicate with the Pd environment using libPd [13], which turns Pure- Data into an embeddable audio library enabling developers to use Pd as the sound engine for their applications.
Another approach is to alternatively use all-purpose programming languages such as C++ or Python and utilise external libraries especially designed for sound and music creation. Today, several libraries exist for audio analysis and manipulation for almost every programming language.
In the PSCA, a Python module called sounddevice [31] is used to access and control recording and playback. This module provides bindings
Figure 2.1: First attempt at implementing an audio looper for the PSCA using PureData. Photo: Benedikte Wallace
to PortAudio, a cross-platform audio input/output library written in C.
Using sounddevice the audio is stored in and read from NumPy arrays, a practical format as audio can then be manipulated in an efficient way with a multitude of convenient functions to add, slice and store their content.
Often graphical representation and high-level abstraction can come at the cost of low-level controls, and choosing the right tool for the job is as key for music applications as for any other system. For the development of the PSCA it was crucial to be able to access and manipulate the audio directly in order to store and analyse audio segments, as well as reformatting them to generate predictions from the models. Hence, this approach was the most sensible for this work, using a programming language I was already familiar with to build an application which can deal with both machine learning aspects of the program and the audio manipulation. In the initial stages of this work a looping system was prototyped in Pd. An image of this Pd patch can be found in figure 2.1. However, in order to integrate the looping functions with the machine learning models, it was easier to transfer development to Python. Although this Pd approach was not used in the development of the PSCA in the end, the graphical representations of the Pd environment allowed for easy construction and better understanding of the functions needed to build the audio looper when I later began developing the Python script using the sounddevice library.
2.2 NIMEs
During the last 20 years we have also seen a rapid evolution in musical controllers due in part to the availability of inexpensive hardware and
new sensors that can capture features such as force and position, as discussed by Cook in 2001 [19]. This same year the firstNew Interfaces for Musical Expression(NIME) workshop [60] was held in Seattle, Washington.
It subsequently became a regular conference for musicians, composers, researchers and developers dedicated to research on the development of new technologies and their role in musical expression and performance.
The initial NIME workshop had as its aim to initiate an exchange of ideas in order to, amongst other things, “survey and discuss the current state of control interfaces for musical performance, identify current and promising directions of research and unsolved problems” [60, p. 1]. Since this workshop, the resulting conference has showcased research from a wide variety of contexts, not only pertaining to music controllers but also cross-disciplinary research to answer fundamental questions like “What is a musical instrument?” and “What is a composition?”.
In 2017 Jensenius and Lyons publishedA NIME Reader[40], as the need for a collection of articles that could broadly represent the conference series became apparent. In the preface of this book the editors describe the acronym NIME as having several potential meanings and indeed this field encompasses many themes, from reflective studies on the field itself, to development of new music systems [68], art installations [56], and naturally, the creation of new interfaces and controllers for musical expression. The Wekinator, presented at NIME in 2009 is an example of a NIME that can be used to create NIMEs. The Wekinator [28] is a meta- instrument that allows its user to interactively modify and train many different machine learning (ML) algorithms in real-time. The software is created to allow musicians, composers and new instrument designers to train an algorithm to react in certain ways to certain input. The input sources can vary from custom sensors to gestures and audio. Using the Wekinator GUI the user records examples of input and output mappings and trains the algorithm to recognise these inputs in the future and thus map them to the correct output.
The Wekinator is just one example of a NIME that uses machine learning tools. Another theme at NIME which ties in with my work is the generation of accompaniment for musical improvisation. Kitani and Koike [42] presented an online system which generates accompaniment for live percussion input, simulating the improvisation between two percussionists. Similarly, Xia and Dannenberg [86] present an “artificial pianist” that can learn to interact with a human pianist using rehearsal experience. In the paper by Pachet et al. [57] titled Reflexive Loopers for Solo Musical Improvisation they propose a new approach to loop pedals that addresses the issue that loop pedals only play back the same audio, making performances potentially monotonous. Solo improvisation can be difficult, and the goal for the reflexive looper is to create accompaniment that fits the given input. Reflexive loopers (RLs) differ from standard
loop pedals in two major ways: firstly, the RL determines the playback material according to a predetermined chord sequence as well as the style of the musician’s playing so that the generated accompaniment follows the musician’s performance. Secondly, the RL’s can distinguish between several playing modes. As a result, if the musician is playing bass for example, the RL will follow the “other members" principle and play differently if the input had been a harmony. The NIME conferences have also showcased works relating to vocal performance. The VOICON [58]
is an augmented microphone which allows mappings between the singers gestures when holding the microphone and vocal modifications. Circular motion of the microphone generates vibrato, while tilting causes a change in pitch. The user can also assign a vocal effect and control the amount of the effect by asserting pressure.
The PSCA is in its own way a NIME, although New Instrument for Musical Experimentation would possibly be a more fitting acronym. While the PSCA so far does little in the way of presenting an interface or gesture controls to the user, it does combine some of the other themes seen at NIME, specifically using machine learning in improvisation and accompaniment and new systems for vocal performance.
2.3 Music Information Retrieval
Another important research area that has facilitated the development of systems such as the one presented in this thesis is MIR. Research in this field focuses on technologies to aid in information mining in music. MIR research is often in the intersection of digital signal processing, information retrieval and musicology. As noted by Downie in his chapter on Music Information Retrieval in the annual review of information science and technology in 2003; “The difficulties that arise from the complex interaction of the different music information facets can be labeled the “multifaceted challenge.” ” [23, p. 297]
Downie presents 7 facets of MIR: pitch, temporal, harmonic, timbral, editorial, textual, and bibliographic facets. Each facet has its own representations and brings its own challenges. The pitch facet is concerned with analysing pitch. Pitch is represented in several different ways, i.e., graphical representation (,♩, ), note names (e.g., C, C#, D) or pitch class numbers (0, 1, 2, . . . , 11). Given that pitch can be represented in so many ways, techniques used for analysing pitch vary greatly. The temporal facet includes any information pertaining to the duration of musical events.
This may refer to the duration of pitches, rests or harmonies, metronome indication (i.e., beats per minute) or other, relative, tempo descriptions such asrubatooraccelerando. In many music styles it is expected that musicians may stray from the strict tempo notation of written music, retrieving temporal information can therefore be a quite complex task.
Information concerning simultaneous pitches falls into the harmonic facet. Harmonic events can be denoted using chord names, as in the lead sheets used for this thesis, but this is not always the case. As long as two or more pitches sound at the same time, this is considered a harmony, and often such events are not denoted with chord names. Retrieving harmony information in such cases may require analysis of each of the pitches contained in the harmony.
The timbral facet is composed of information pertaining to thecolourof a note. Timbre is what enables a listener to distinguish between a note played on a piano and the same note being played on a clarinet. Thereby, the timbre of an instrument can be used to identify it amongst other instruments in an audio recording. This requires relatively advanced signal processing capabilities in order to separate and examine the amplitude and duration of the frequencies that make up the audio signal. The editorial facet is comprised of performance instructions, or lack thereof.
Performance instructions include dynamic instructions, articulations and ornamentation and can be given in a textual (e.g., crescendo, decrescendo) or an iconic (e.g., <, >) format, and sometimes both. Different musicians may use differing editorial information when transcribing the same piece of music, this causes problems when attempting to decide on a definitive version of a work for inclusion in a MIR system. The final two facets are the textual and bibliographic facets. The textual facet is concerned with song lyrics, while the bibliographic facet pertains to information such as the work’s title, names of composers and publishers, publication dates and so on. In other words, information pertaining to the bibliographic facet is not derived from the content of the composition, rather it is music metadata.
In 2008 Lartillot et al. presented the MIRToolbox [44], a set of functions written in Matlab for extracting features from audio. The MIRToolbox contains the necessary functions to extract information on for example rhythm, tonality, timbre and form of the audio. A problem shared by those in the MIR community is that of multiple representations of music.
Collections of music can contain several different types of audio recording (CD, LP, MP3 and so on) as well as multiple symbolic representations (printed notes, MIDI and many others). Therefore, MIR encompasses many different techniques and technologies in order to handle the various representations and tasks.
Another significant problem in MIR research is the ability for researchers to access large collections of music. Sonny Bono Copyright Term Extension Act and other similar laws have led to a considerable decrease in the number of sound recordings in the public domain. This also means that researchers are unable build a shared, persistent, collection of music written after 1922 [24]. This is an issue I experienced myself during my work, as the lead sheet datasets used by other researches cannot be shared publicly. Wikifonia.org obtained a temporary licence to share a large set
of lead sheets from various genres, but unfortunately their licence expired in 2013, and the dataset is no longer available through their site. Instead researchers in need of lead sheets often use private collections and give little specific information regarding their contents.
Numerous MIR tools have been used in the development of the PSCA.
Most prominently the music21 package [20] and the librosa module [50].
As well as using several of the convenience functions offered by librosa for transforming pitches to notes and so on, the function piptrackis used for pitch analysis. The piptrack function uses parabolic interpolation to calculate STFT, short-time Fourier transform [73]. The Fourier transform is a core component of sound processing and synthesis. Using the Fourier transform we can decompose sounds into its elementary sine waves, enabling both analysis and transformation of the original audio signal.
These principles are also used for creation of new sounds from computer generated sine waves. Music21 is a Python toolkit for analysing and generating music scores in the music XML format. It allows for easy extraction of chord and note information as well as information pertaining to the score itself such as key, time signature and tempo.
2.4 Musical Artificial Intelligence
Examples of artificial intelligence, or AI, in music can be divided into the following categories: compositional AI, improvisational AI and AI for musical performance [21]. The history of musical AI has its roots in algorithmic music composition. Therefore, the examples of compositional AI have the longest history. Though often considered to be pioneered by Lejaren Hiller’s development of a language for computer-assisted music composition named MUSICOMP, the history of algorithmic composition can be said to go back a good deal further [66]. The pianola, for example, was constructed in the 1800s and plays songs automatically by reading a piano roll, similar to a punch card. Other pioneers of this field, such as Iannis Xenakis, also used composition algorithms without aid of computers. Xenakis criticised [85] the results of serialism’s strict predetermination and suggested a solution using statistical methods such as Markov chains and probabilistic logic. This approach of using Markov models, or the hidden Markov model (HMM), to generate new music is still used today. Allan and Williams used this approach for generating chorales in the style of Johann Sebastian Bach [4] and it is also the algorithm used by Simon et al. in the SongSmith system.
SongSmith [71] is a system that automatically chooses chords to accom- pany vocal melodies. By singing into a microphone the user can experi- ment with different music styles and chord patterns through a user inter- face designed to be intuitive also for non-musicians. The system is based on a HMM trained on a music database consisting of 298 lead sheets, each
consisting of a melody and an associated chord sequence. The HMM repre- sents probability distributions over sequences of observations, allowing it to learn chord transition probabilities. Examining each song in the database the system learns the statistics governing chord transitions in a song by tal- lying the number of transitions from each chord to every other chord.
Although the HMM has been used for music generation and harmony prediction tasks with some success, probabilistic models like the HMM do have some limitations. Markov models can, at best, reproduce examples they have seen in the training data. In contrast, neural networks such as the Recurrent Neural Network (RNN) are able to learn a fuzzy representations of the data. Such artificial neural nets use their internal representations to perform a high dimensional interpolation between training examples, as noted by Graves [33], and rarely generates the same thing twice.
InA Connectionist Approach To Algorithmic Composition[78] published in 1989, Peter M. Todd describes a RNN approach to generating symbolic music data note by note. Using RNNs has been shown to be somewhat limited for music generation as the RNN shows poor results for long term dependencies due to the problem of vanishing gradients during back propagation [8]. In an attempt to resolve this issue, the Long Short-Term Memory (LSTM) was presented in 1997 by Hochreiter and Schmidhuber [36]. In 2015, Nayebi and Vitelli at Stanford University presented GRUV [53], a music generation system based on recurrent neural nets. The networks take raw audio as input. The input is formatted into a sequence of audio vectors representing the waveform at time t and the system aims to predict the waveform at timet+1. Their dataset consists of two corpuses each containing 20 songs each. One data set contains songs written by the electronic musician Madeon, the other, songs by rock musician David Bowie. Two recurrent neural nets were compared in their paper, a GRU, gated recurrent unit (first presented by Cho et al. in 2014 [15]), and a LSTM network. Their findings showed that the audio generated by the LSTM significantly outperformed that of the GRU.
Though the training and validation loss was only slightly lower for the LSTM, the GRU generated audio consisting mostly of white noise, while the LSTM was able to generate musically plausible audio sequences. The image of the generated spectrograms can be found in figure 2.2. It has been included here in order to show the improvements of LSTM networks on long-term dependencies.
JamBot, developed in 2017 [14] uses two RNN LSTMs, one for predicting chord progressions based on a chord embedding similar to the natural language approach of word embeddings used in Google’s word2vec [52], another LSTM is used for generating polyphonic music for the given chord progression. Their results do exhibit long term structures similar to what one would hear during a jam session. The LSTM unit was used by Schmidhuber and Eck in 2002 to improvise blues music [25], concluding
that the LSTM “successfully learned the global structure of a musical form, and used that information to compose new pieces in the form.” [25, p. 755]
Eck later became a part of the Magenta team at Google Brain.
Figure 2.2: Results reprinted from Nayebi and Vitelli’s paper GRUV:
Algorithmic Music Generation using Recurrent Neural Networks[53]
Magenta is a research project devoted to the exploration of the role of machine learning in creating art and music. Recent developments include MusicVAE [67] and NSynth (Neural Synthesizer) [26]. The autoencoder used in NSynth is similar to WaveNet, a network designed at Google Deep- Mind in 2016. WaveNet [55] uses a fully convolutional network designed to input and output raw audio. As well as showing promising potential for enhancing text-to-speech applications, the researchers have also applied the WaveNet to the task of generating music. The Convolutional Neural Network (CNN) approach is not often used to solve problems that contain long-term dependencies such as music generation. When the network is trained on about 60 hours of solo piano music collected from YouTube the researchers found that enlarging the receptive field of the convolution fil- ters was crucial to obtain samples that sounded musical. Modelling these longer sequences proved to be problematic even with receptive fields last- ing several seconds. Nevertheless, the paper notes that the samples pro- duced were often both harmonic and aesthetically pleasing.
With NSynth, Magenta presents both a WaveNet-style autoencoder and a large-scale and high-quality dataset of musical notes. This dataset is an order of magnitude larger than comparable public datasets. In contrast to traditional audio synthesis, NSynth generates sounds using the learned embeddings of the autoencoder. Users can then experiment with different timbres to create new sounds. In collaboration with Google Creative Lab the Open NSynth Super (shown in figure 2.3) was created as an experimental physical interface for the algorithm.
Figure 2.3: Open NSynth Super physical interface for the NSynth algorithm. [43]
Magenta is also the team behind the Performance-RNN [72] which generates expressive timing and dynamics via a stream of MIDI events.
These MIDI events are then transferred to a synthesiser to generate piano sounds. Their network learns melody, timing and velocity using a dataset of MIDI performances created by skilled pianists.
The research presented so far in this section is mostly focused on com- positional AI, but as mobile touch screens and cloud-based applications become able to perform more and more advanced tasks we have also seen a rise in machine learning in musical performance and collaboration. Mar- tin and Torresen recently presented RoboJam [48], a machine learning sys- tem that assists users of a music app by generating short responses to their improvisations in near real-time. It uses an RNN followed by a Mixture Density network in order to learn touch screen interactions and predicts a sequence of control gestures, thus taking the role of a remote collaborator.
Since the system focuses on free-form touch expressions it has the potential for use in different creative apps, with many different musical mappings.
Martin et al. also presented an artificial neural net for ensemble improvi- sation using a touch screen musical app [47]. The network is trained on a dataset of time-series gesture data collected from 150 performances on a mobile percussion app. The network can generate the simulated perfor- mance of one or three other players to accompany a real, lead player, thus creating either a duet, or a quartet. This allows a single performer to gen- erate an ensemble performance.
The work presented in my thesis contributes to the research on popular ML approaches, the LSTM and HMM, for music generation and use in performance, by comparing prediction accuracy on the training data as well as implementing them into the PSCA for real-time improvisation.
Chapter 3
Sequence Classifiers
This chapter presents descriptions of the machine learning algorithms that were compared and implemented into the PSCA system. As the goal for the PSCA is to generate chords using note observations, a classifier is needed that can model the relation between notes and chords as well as the progression of the chords over time.
In this section, two sequential machine learning models are discussed:
the hidden Markov model (HMM) and Recurrent Neural Network (RNN).
The RNN and the HMM are often referred to as sequence models, or sequence classifiers, as their goal is to map a series of observations to a series of class labels. These models are used in the PSCA to solve the harmony prediction problem. This task involves learning the dependencies between chords in a chord progression and the relationships between the notes in a melody and the chords that accompany them.
Figure 3.1: The harmony prediction task: Given a set of consecutive measures containing a monophonic melody, select a suitable harmony (qn) for each measure.
3.1 Harmony prediction
The task these models are trained to solve is in this thesis referred to as the harmony (or chord) prediction problem. At times the problem will also be
referred to as chord prediction as the harmonies predicted by the models represent a triad chord. Moreover, in some sections it has been practical to separate between references to the chords predicted by the models and the harmonies produced when a performer uses the PSCA. The latter is affected by the underlying audio loop as well as the predicted chords. The harmony prediction task can be stated as follows:
Given a sequence of consecutive measures containing a monophonic melody, select a suitable harmony to accompany each measure.
The inputs to the models are the notes contained in a measure. The target value of each measure is a harmony represented by a chord symbol. The task is illustrated in figure 3.1. Here the harmonies are denoted asq and the figure shows how the choice of harmony is affected by the notes in the corresponding measure as well as the preceding harmonies.
3.2 Hidden Markov Model
Hidden Markov models were first presented by Baum et al. in the article A maximisation technique occurring in the statistical analysis of probabilistic functions of Markov chains[7], and have been applied to a range of problems involving time series data [63], most notably perhaps, to problems in speech processing [79], [5]. The HMM is characterised by an underlying process governing an observable sequence. For the harmony prediction problem, the observable sequence is the notes found in each measure, while the process which governs these observable sequences is the underlying chord progression.
Figure 3.2: A Markov model for transitions between four chords. Transition and entry probabilities are omitted.
3.2.1 Markov Chains
In order to describe the structure of the HMM, we begin by describing Markov chains. A Markov chain is a sequence of states that satisfies the Markov assumption, that is, that the probability of moving to any state depends only on the current state. A Markov chain is a sequence which is generated using the following components:
1. A set ofNstates,S
S={s1,s2,s3,s4, . . . ,sN} (3.1)
2. A transition probability matrixA
A={a0,1,a0,2, . . . ,aN,1, . . . ,aN,M} (3.2) Everyai,jrepresents the probability of moving from state i to state j.
And,
∀i∈ N,
∑
N j=1aij =1 (3.3)
3. π, the initial probability distribution over all states.
πiis the probability that the Markov chain will start in statei.
Also,
∑
N πi=1=1 (3.4)
3.2.2 Hidden Events
The Markov chain is useful for calculating probabilities for the next event in a sequence when we can observe the sequence directly in the world.
Other times, what we are interested in is not the sequence we can observe, but a “hidden” sequence which governs the resulting observable sequence.
In the task of speech recognition, the observations would be the audio input, and the hidden states would be the words that are contained in the audio. In harmony prediction tasks, the hidden states are the chords, and the duration of the notes in the melody are the observations.
Markov chains with hidden states can be described formally as follows [11]:
1. A set ofNstates,S
S={s1,s2,s3,s4, . . . ,sN} (3.5) 2. A transition probability matrixA
A={a0,1,a0,2, . . . ,aN,1, . . . ,aN,M} (3.6) Everyai,jrepresents the probability of moving from state i to state j.
And,
∀i∈ N,
∑
N j=1aij =1 (3.7)
3. π, the initial probability distribution over all states.
πiis the probability that the Markov chain will start in statei.
Also,
∑
N πi=1=1 (3.8)
4. The observation space,O:
O= {o1,o2,o3. . .ok} (3.9) The set of unique, observable tokens.
5. An emission probability matrix,B.
B={b1,1,b1,2,b1,3, . . .bN,k} (3.10) Everybi,jrepresents the probability of observingoj in statei.
6. A sequence oflobservations,Y.
Y =y1,y2,y3. . .yl (3.11) Each observationyi contains the observed data at time stepi.
In a supervised learning scenario (such as harmony prediction), the HMM is learned by building the aforementioned matrices (A,B and π) empirically according to a set of training examples. Training the HMM for the harmony prediction problem consists of calculating the transition (A), start (π) and emission (B) probabilities for each state using the information found in the lead sheets. Unsupervised procedures that are often used in training HMMs when the hidden processes are unknown, require the use of Expectation Maximisation [22] or similar strategies. In contrast, for the harmony prediction problem presented in my work, the hidden states and the probabilities for transitioning between them can be drawn directly from the data. The matrices are populated directly by counting occurrences while traversing each measure of each lead sheet. This process is described further in the upcoming chapter on the design and implementation of the system (Section 4.3.4).
Figure 3.3: HMM chord states and note emissions
3.2.3 Decoding
Decoding is the process of deciding what hidden sequence is most likely for a given sequence of observations. For this task one of the most common decoding algorithms for HMMs, the Viterbi algorithm [29] is used. Given a model, HMM = (A,B,π) and a set of observations, O, the Viterbi algorithm begins by looking at the observations from left to right. For each
observation, the max probability for all states is calculated. By keeping a pointer to the previously chosen state the algorithm can back trace through these pointers to construct the most likely path. Given a set of observations y = y1,y2, . . . ,ytas the input, the Viterbi algorithm will return the path of statesx =x1,x2, . . . ,xtusing the following steps:
Viterbi1,N = P(y1|N)∗πN (3.12a) Viterbit,N =maxx∈S(P(yt|N)·ax,t·Viterbit−1,x) (3.12b) 3.2.4 Applying HMM to Harmony Prediction
Disregarding octave information there are 12 possible notes we can observe. Thereby, an observation can be defined as a vector which represents the 12 notes of the chromatic scale and contains the durations of each note. The emission probabilities represent the relationship between what notes are played and our set of chord states. The emission probability matrix contains note probability distributions for each state.
3.3 Recurrent Neural Nets
The recurrent neural net is a type of artificial neural net (ANN), a system which contains interconnected layers of nodes or neurone. Each node has an input where signals from the previous layer of nodes can enter, and an output where the signal from the current node can travel to the next layer of nodes. The strength of each node’s signal is regulated by its internal weights. Learning is achieved by updating these weights by feeding the input forwards and then propagating the error of the output layer backwards through the network, from the output layer to the input layer. Such ANNs are called feedforward nets. The internal weights of each node are updated individually, meaning the weights of one node are not directly affected by the weights of any other nodes. In a way the RNN can be thought of as a feedforward net which is unrolled over time where the weights are shared across time steps. The RNN applies a recurrent connection, which allows the state of the previous time step to affect the current time step.
When humans take in information, for example when watching a film, we process the audio and video bit by bit from start till end. We do not forget what we have seen or heard as soon as it has been processed by our brains, we hold the information in our memory in order to understand the story. This is the underlying idea behind RNN. As with HMMs, RNNs have also shown good results when used in other time series tasks such as speech recognition, as demonstrated by Graves et al. in 2014 [34]
The input to a simple RNN has the form(time_step, input_features).
The RNN internally loops over the input features at each time step while
Figure 3.4: RNN
maintaining an internal state. In each loop the RNN considers the state at this time step as well as the input features at this time step when calculating its output. The output then becomes the new state for the next time step.
3.3.1 Learning
This way the RNN can learn temporal dependencies from the start of the sequence till the end. More specifically the RNN is trained using BPTT, “Back Propagation Through Time” or RTRL, “Real-Time Recurrent Learning”. At the start of every new sequence the internal state of the RNN is reset, hence the RNN does not retain any information between sequences, only between time steps in each sequence.
Figure 3.5: RNN unrolled over time
3.3.2 LSTM: Long Short-Term Memory
Even though the RNN can, in theory, learn time dependencies found in the input sequence, there are still problems with learning long-term dependencies. As the number of time steps in a sequence increases, the RNN starts to “forget” what it has learned from the first time steps. This is due to the back propagation training procedure producing vanishing or
exploding gradients. This phenomenon was examined by Hochreiter et al. in their articleGradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies[37] as well as by Bengio et al. inLearning Long-Term Dependencies with Gradient Descent is Difficult[8]. Long Short-Term Memory (LSTM) is designed to get rid of the exactly these issues.
Proposed in 1997 by Hochreiter and Schmidhuber [36], LSTM uses a so- called CEC (“Constant Error Carousel”) to carry information across many time steps. This prevents the information from earlier time steps from vanishing. When looking at the LSTM unrolled over time (shown in figure 3.6) one can imagine the CEC as a conveyor belt carrying information across time steps.
Figure 3.6: LSTM cell unrolled over time
Since LSTM allows information to be stored across arbitrary time steps, it is also important to forget information which is not important. Otherwise, the states of a cell may grow in an unbounded way. In the reportLearning to Forget: Continual Prediction with LSTM, Schmidhuber, Gets and Cummins describe a solution to the problem of unbounded growth in cell states, the Forget Gate. The Forget Gate learns to reset the cell memory when the information in it has become useless [32]. In general, the LSTM cell consists of three multiplicative gates (see figure 3.7): the input, output and forget gates.
3.3.3 Bidirectional RNN
The bidirectional RNN [70], as the name implies, the BRNN combines an RNN which loops through the input sequence forwards through time with an RNN which moves through the input backwards through time. Thereby, bidirectional networks can learn contexts over time in both directions.
Separate nodes handle information forwards and backwards through the
Figure 3.7: Figure showing the three gates of a single LSTM cell.
network. Thus, the output at timetcan utilise a summary of what has been learned from the beginning of the sequence, forwards tilltas well as what is learned from the end of the sequence, backwards tillt. Bidirectional LSTM was presented by Graves and Schmidhuber in 2005 [35] and is arguably a better choice than RNN or LSTM for the PSCA, as the current chord at some time step is affected both by the preceding chord and the next chord.
The Bidirectional Long Short-Term Memory (BLSTM) can learn both these contexts.
3.3.4 Applying BLSTM for Harmony Prediction
When training the BLSTM for harmony prediction the measures found in the lead sheets represent the time steps. The dataset is structured into sequences of eight measures. The first sample of the dataset is the first eight measures: measure 1 to 8, for the next sample we move one step to the right and create a sequence which contains measures 2 to 9 and so on. Each measure consists of a chord symbol (rep- resented using one-hot encoded matrices) and an associated note vec- tor with 12 elements corresponding to the duration of each of the 12 possible notes. The input to the BLSTM is thus given the form (number_of_sequences, sequence_length, note_vector) and the out- put will have the form
(number_of_sequences, sequence_length, one_hot_encoded_targets).
In the upcoming chapter, section 4.3.5, the implementation of the BLSTM is described in further detail. In figure 4.11 the code for building the BLSTM model is shown.
Chapter 4
Design and Implementation
This chapter gives an overview of the work done to develop the PSCA.
This includes the implementation of an audio looper written in Python, descriptions of pitch detection and chord construction as well as details of the dataset and the implementation of the machine learning approaches.
4.1 The audio looper
(a) Python looper implementation (b) Arduino and foot switches
Figure 4.1: The PSCA system consists of software and hardware compo- nents. The Python scripts for running the PSCA system are available at https://github.com/benediktewallace/PSCA. Photos: Benedikte Wallace
The underlying system developed for the PSCA is an audio looper written in Python that is controlled by a pair of foot switches through an Arduino. When the program is running the user sings into a microphone and controls the system using the foot switches. One switch controls playback on or off, the other controls recording on or off. The user can also clear the playback loop by pressing a small button to start over. During