Beamforming for condition monitoring

(1)

Beamforming for condition monitoring

Nikolai West Orheim

Thesis submitted for the degree of Master in Electrical Engineering

(Signal Processing) Department of Physics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Beamforming for condition monitoring

Nikolai West Orheim

(4)

Beamforming for condition monitoring http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

Acoustic condition monitoring is an important technology for non-invasive condition monitoring, but using robust array signal processing techniques combined with machine learning is still not a fully explored field.

This thesis examines how different microphone arrays and beamforming algorithms affect features commonly used in audio classification. We look at three separate arrays, one uniform rectangular array, and two uniform circular arrays with different diameters and microphone elements.

Additionally, we are taking a look at the conventional delay and sum (DAS) beamformer, as well as two adaptive beamformers; a broadband minimum variance distortionless response (MVDR) beamformer using discrete Fourier transform (DFT), and the generalized sidelobe canceller (GSC) using the normalized least means square (NLMS) approach.

We show that the size of the array and the number of microphones have a significant impact on its performance regarding directivity and white noise gain. The position of the microphone elements also plays an important role to avoid spatial aliasing. The larger circular array outperforms both of the other arrays, but the smaller arrays have the upside of being easily maneuverable and can be placed in smaller, confined places. The rectangular array has some spatial aliasing for the higher frequencies in the audible range, but as long as it is possible to ensure that the frequency range of our source signal is band-limited to its non-aliasing spectrum, this performs significantly better than the smaller circular array.

The three beamformers all perform well when steering is correct, but the adaptive beamformers quickly attenuate the source signal when there is a steering mismatch. The NLMS based GSC beamformer would likely cause problems for a batch based machine learning algorithm due to the continuous updating of its weights. The DFT based MVDR beamformer has issues with phase-shift error if the number of buffered samples is too low, but as long as a sufficient number of samples is used, it performs better than the GSC in every scenario tested. This implies that the MVDR is the preferred beamformer of the two if high directivity is needed. DAS is the safer choice of the beamformers, as there are no parameters to tune for a given environment, and performs well in every tested setting.

(6)

List of Figures

1.1 Layout of condition monitoring . . . 1

1.2 Illustration of beamforming scenario . . . 2

2.1 Spherical coordinate system . . . 8

2.2 Propagation of a spherical wave . . . 10

2.3 Spherical wave, near-field . . . 11

2.4 Plane wave, far-field . . . 12

2.5 Near-field, far-field limit . . . 12

2.6 Waveform of faulty car engine . . . 14

2.7 RMSE of faulty car engine . . . 15

2.8 FFT of faulty car engine . . . 16

2.9 Spectrogram of faulty car engine . . . 17

2.10 Linear vs Mel scale . . . 18

2.11 LFCCs of faulty car engine . . . 19

2.12 LFCC-∆s of faulty car engine . . . 20

2.13 Aperture smoothing function for a ULA with 10 sensors . . 22

2.14 Delay, weight and sum beamformer. . . 23

2.15 MVDR vs DAS weights for 10 element ULA . . . 25

2.16 Block diagram for broadband DFT MVDR beamforming . . 27

2.17 Block diagram of the GSC . . . 29

2.18 Diagram of LMS based GSC . . . 30

2.19 Spatial response of blocking matrices . . . 32

2.20 Typical array pattern . . . 37

2.21 Beamforming spectrums with DAS and MVDR . . . 38

3.1 Layout of 2D microphone arrays . . . 39

3.2 Beampattern in the UV space for the arrays . . . 40

3.3 Contour of frequency response of arrays . . . 41

3.4 Beampattern with peak-to-zero width shown. . . 42

3.5 Peak-to-zero width for arrays . . . 43

3.6 Maximum sidelobe levels for arrays . . . 43

3.7 Comparison of simulated beamforming with chirp and narrowband interference . . . 44

3.8 Cross-section from the STFTs of beamformed output steered towards chirp . . . 45

3.9 Total power of beamformed signal for WGN source with moving interference . . . 47

3.10 Correlation coefficient of beamformed signal for WGN source with moving interference . . . 48

3.11 Total power of the beamformed signal for WGN source with increasing interference . . . 49

3.12 Correlation coefficient of beamformed signal for WGN source with increasing interference . . . 49

4.1 Photograph of 128- and 256-element UCA . . . 51

4.2 Photograph of 256-element URA . . . 51

(9)

4.3 Frequency response of INMP421 . . . 52

4.4 Frequency response of INMP521 . . . 52

4.5 Experiments layout . . . 53

4.6 Setup of 256-element UCA in the anechoic room . . . 54

4.7 View from the array during testing in anechoic room, with superimposed intensity plot. . . 55

4.8 View from the array during testing in a normal office environment, with superimposed intensity plot . . . 56

4.9 Image of model tank car placed in the sound room . . . 57

5.1 Moving interference array comparison in anechoic room . . 59

5.2 Moving interference array comparison in office environment 60 5.3 Mean power of beamformed signal with moving interference in anechoic room . . . 61

5.4 Mean power of beamformed signal with moving interference in office environment . . . 62

5.5 Maximum correlation between beamformed signal and present sounds using broadband MVDR . . . 63

5.6 Maximum correlation between beamformed signal and present sounds using GSC . . . 64

5.7 Comparison of blocking matrix output . . . 65

5.8 Beampatterns of the sum of the rows for the blocking matrices 65 5.9 Spectrogram of 256 UCA steered towards chirp in anechoic room . . . 66

5.10 Cross-section from spectrograms from Figure 5.9 . . . 67

5.11 Spectrogram of 256 UCA steered towards chirp in office environment . . . 67

5.12 Cross section from spectrograms from Figure 5.11 . . . 68

5.13 Spectrogram of sounds from tank car from single microphone 69 5.14 Fourier transform of sounds from tank car . . . 70

5.15 RMSE of the normal and error state of the tank car . . . 71

5.16 Mean of all LFCCs calculated in the sound sample . . . 72

5.17 IDCT of mean of all LFCCs calculated in the sound sample . 73 5.18 Array view of tank car in anechoic room . . . 74

5.19 Tank car beamformed FFT comparison . . . 75

5.20 Tank car beamformed RMSE comparison . . . 76

List of Tables

5.1 Beamformer comparison in anechoic room . . . 61

5.2 Beamformer comparison in office environment . . . 62

(10)

Acknowledgment

First of all, I would like to thank Squarehead Technology for providing me with the opportunity to write this master thesis, for letting me use their equipment whenever it was needed, and for being a great place to work.

I would also like to thank my supervisors, Jørgen Grythe, Ines Hafizovic, Sverre Holm, and Jon Petter Åsen for all their input, and for the help and guidance they have given me during this thesis. Also a big thank you to Thomas Ballo for four and a half years of studying together, and for the countless hours and long days of thesis writing and exam preparation.

A big thank you to Marie Totland, for proofreading this thesis.

(11)

Chapter 1

Introduction

1.1 History and Introduction to Concepts

1.1.1 Condition Monitoring

Condition monitoring is the process of monitoring and determining the condition of machinery while it is operating. By doing this, maintenance can be scheduled in case of a failure before any significant damage is done to the system. Different methods of condition monitoring have been used on machinery since the 1950s, as machines became more intricate and complicated to fix. Traditionally, vibration-based condition monitoring have been the most widespread method of automatic detection, but other methods such as infrared, acoustic and electrical monitoring — to name a few — are also being used. Acoustic condition monitoring is suitable for any application where a machine running in normal condition has a stable noise spectrum, and a fault in this machinery causes the noise spectrum to change.

The process of automatic condition monitoring can roughly be split up into three steps: data acquisition, data processing and fault diagnostics and prediction, as in Figure 1.1. This thesis will focus on the data acquisition and the processing of data by using different beamforming techniques.

Figure 1.1: Layout of condition monitoring

1.1.2 Beamforming

Beamforming is a signal processing technique where several sensors are used for directional transmission or reception of a signal. It has a long history and has been studied in many areas such as radar, radio astronomy, sonar, communications seismology, etc. As acoustic waves are the center of attention of this thesis, no focus will be given to beamforming of other types of waves or beamforming at the transmission. Recording of data will be done using arrays of microphones capturing sound within the audible range — sounds with frequencies up to approximately 20 kHz.

(12)

The concept behind beamforming is to introduce a delay at the microphones’ inputs to “steer” the focus of the beam in the direction of interest.

A scenario like this is illustrated in Figure 1.2, where we have a 10 element Uniform Linear Array (ULA) along the x-axis, and a source transmitting at a degreeθfrom normal incidence angle (θ =0^◦).

x y

θ

Figure 1.2: Illustration of beamforming scenario

The “correct” delay on each element can be calculated from the speed of propagation of the wave, the angle θ and the microphones positions (this is explored further in section 2.3). The signals from the channels can then be combined, which causes the Signal of Interest (SOI) to be summed constructively, while signals and noise from other directions are summed destructively and suppressed.

1.1.3 Machine Learning

Machine learning is a subfield of computer science focused on computer algorithms that learn from the data and the information it is given. It is closely connected to computational statistics, and the outcome of the algorithms are based on the statistical properties of the data that is being analyzed. The field of machine learning has been around since the 1950s but has gotten a surge recently due to the development of neural networks, as well as increased computational power.

There are several different approaches to machine learning, which can be categorized into three different classes:

• Supervised learning: when the computer is presented with training inputs as well as their desired inputs by a supervisor. The goal of these algorithms is to learn the machine general rules that map inputs to outputs.

• Unsupervised learning: when no labels or expected outputs are given to the algorithm, and it is left to itself to figure out a structure in the input.

(13)

• Reinforcement learning: similar to unsupervised learning in the way that no correct input/output pairs are presented, but has an additional step where it is provided feedback in terms of reward or punishment as it navigates the problem space.

1.2 Motivation and Goals

The standard way of performing preventive maintenance is by planned maintenance, where the machine or parts of interest are checked or replaced at fixed time intervals, for instance after a certain number of hours or years. The disadvantage of this approach is that the user may either change equipment or parts before it is due or not detect a faulty machinery before failure occurs, leading to downtime and expensive repairs. Condition monitoring and predictive maintenance, on the other hand, involves the continuous assessment of the condition of a plant or machinery while it is running, with the goal of detecting errors or flaws in the equipment before failure occurs. Typical assessment technologies include thermography, corrosion monitoring, lubricating oil analysis, motor current signature analysis, vibration monitoring and ultrasound and acoustic emission monitoring.

With acoustic condition monitoring a machine running in a healthy condition has a known and stable noise spectrum. If this spectrum deviates from one or several predefined healthy states, this may be an indication of potential errors in the machine. Acoustic emission monitoring is also a non- invasive technology and can be used while the equipment is operating and can give instantaneous diagnosis and feedback.

By using microphone arrays to record the acoustic signals as opposed to a single microphone, it is possible to single out a specific listening direction, or several listening directions simultaneously while attenuating surrounding noise. This, in turn, makes it possible to listen to several selective parts of the machinery with one microphone array, which can be placed at an appropriate distance from the machinery in question. Acoustic condition monitoring with microphone arrays in noisy locations, such as plants and factory environments, has a potential to become a robust monitoring and diagnostic tool. Array processing algorithms combined with classification strategies and machine learning is still not a fully explored field, and new insights on how to optimally unite these for robust condition monitoring are needed.

We wish to gain insight into the relationship between the beamformer performance, and the performance of a selected classifier. We want to explore this in the context of condition monitoring limited to a few representative monitoring and diagnostics scenarios and for a selection of representative fixed and adaptive beamformers. The aim is to develop an evaluation model regarding standard beamformer metrics such as resolution, directivity, white noise gain, aperture size, etc., in hope to be

(14)

able to prescribe the “optimal” array-beamformer combination for a set of audio features commonly used in audio classification.

1.3 Outline of the Thesis

Chapter 2 contains all the background theory used to develop and understand the algorithms used in the rest of thesis.

Chapter 3 is a collection of different simulations to illustrate the different beamformers and their properties, as well as more complex scenarios simulating real-life situations.

Chapter 4 contains the experiments done throughout the thesis and an overview of how they were performed.

Chapter 5 compares the performance of the different beamformers from a condition monitoring perspective.

Chapter 6 concludes the most significant results and suggests future work that might be of interest.

(15)

Acronyms

DAS Delay and Sum . . . 22

DCT Discrete Cosine Transform . . . 17

DFT Discrete Fourier Transform . . . 21

DOA Direction of Arrival . . . 38

FIR Finite Impulse Response . . . 30

FWHM Full Width, Half-Maximum . . . 37

GSC Generalized Sidelobe Canceller . . . 28

HPBW Half-Power Beamwidth . . . 37

IDCT Inverse Discrete Cosine Transform . . . 72

IDFT Inverse Discrete Fourier Transform . . . 26

LCMV Linearly Constrained Minimum Variance . . . 28

LFCC Linear Frequency Cepstral Coefficient . . . 17

LMS Least Mean Squares . . . 31

MFCC Mel-Frequency Cepstral Coefficient . . . 16

MVDR Minimum Variance Distortionless Response . . . 24

NLMS Normalized Least Mean Squares . . . 33

RMSE Root Mean Square Energy . . . 14

ROC Receiver Operating Characteristic . . . 47

RPM Rounds per Minute . . . 15

SNR Signal-to-Noise Ratio . . . 21

SOI Signal of Interest . . . 2

SPL Sound Pressure Level . . . 11

STFT Short-Time Fourier Transform . . . 16

UCA Uniform Circular Array . . . 39

ULA Uniform Linear Array . . . 2

URA Uniform Rectangular Array . . . 39

WGN White Gaussian Noise . . . 46

(16)

Mathematical Notation

A Matrix; bold, uppercase

α Slowness vector

α NLMS algorithm step size

b Vector; bold, lowercase

B Blocking matrix

B(·) Beampattern

c Speed of sound [m/s]

C Constraint matrix

(·)^∗ Conjugate

(·)^H Conjugate Transpose (x,y,z) Cartesian coordinates

D Aperture size [m]

δ Diagonal loading parameter

δ(·) Discrete impulse function

∆m Time-delay on sensorm

∂ Partial derivative

e Steering vector

E(·) Expected value

f Frequency [Hz]

f s Sampling frequency

g Constraint value vector

G Gain

I Identity matrix

k Spatial angular frequency

L LMS algorithm filter length

λ Wavelength

M Number of array sensors

µ LMS algorithm step size

N Number of samples

∇² Laplace operator

P Signal power

P(·) Power

φ Azimuth angle

ψ Wavefield

r Distance from origin

rm Distance from sensorm

R Spatial correlation matrix

Rˆ Sample correlation matrix

s Source signal

σ Standard deviation

t Time [s]

θ Elevation angle

(·)^T Transpose

w Vector of sensor weights

wm Weight on sensorm

(17)

W Matrix of sensor weights W(k) Aperture smoothing function

ω Temporal angular frequency

ξ Wave propagation direction

xm Signal on sensorm

X Recorded signal matrix

y Beamformed signal

(18)

Chapter 2

Background Theory

2.1 Propagating sound waves

The coordinate system used in all of the equations in this chapter is shown in Figure 2.1. We have a three-dimensional coordinate system, where θ is the elevation angle — also called the normal incidence angle, φ is the azimuth angle in the x-y plane and r is the distance from the origin. For the two-dimensional case, when using a 1D array, the coordinate system shown in Figure 1.2 is used.

x

y z

r

;

φ θ

Figure 2.1: Spherical coordinate system

The relationship between the Cartesian and spherical coordinates shown in Figure 2.1 is given by

x=rsinθcosφ, y=rsinθsinφ, z=rcosθ.

(2.1)

(19)

2.1.1 Wave equation

To get a deeper understanding of the physics behind beamforming, we start with the wave equation to understand how sound waves propagate in air.

The simplest form of wave equation is the scalar linear homogeneous wave equation. This is given by

∇²^ψ(x,t)−_c¹₂_∂t^∂²₂^ψ(x,t) =0, (2.2) where x = (x,y,z), c is the speed of the wave, and ψ(x,t) is the three- dimensional wavefield. ∇²is the Laplace operator, defined as

∇²= ^∂

2

∂x² + ^∂

2

∂y²+ ^∂

2

∂z². (2.3)

The simplest solution to this wave equation is the monochromatic plane wave. This is a wave with a single frequency f and a direction of propagationk. This complex wavefield is described as

ψ(x,t) =Aexp(j(ωt−^k·^x)), (2.4) whereω =2πf. If we substitute this back into the wave equation, we get

∇²^exp(j(ωt−^k·^x))− _c¹₂ ^∂

2

∂t² exp(j(ωt−^k·^x)) =0, (k²_x+k²_y+k²_z)exp(j(ωt−^k·^x))− ^ω_c₂² ^exp(j(ωt−^k·^x)) =0,

(2.5)

and we see that as long as the relationship

k²_x+k²_y+k²_z = ^ω

2

c² (2.6)

holds, this is a valid solution to the wave equation. This leads to the relationshipsk= ^ω_c — the dispersion relation, andλ= ^c_f.

The wave equation above can also be expressed in the spherical coordinates (r,θ,φ)as the general spherical wave equation, wheres= s(r,θ,φ,t)is the wavefield

1 r²

∂

∂r

r² ∂

∂rs

+ ¹

r²sinθ

∂

∂θ

sinθ ∂

∂θs

+ ¹

r²sin²θ

∂

∂θ²s− _c¹₂_∂t^∂²₂^s=0.

(2.7)

The spherical wave equation is usually used in situations where we have spherical symmetry, e.g., if we have a point source and the wave radiates

(20)

in a spherical pattern from that source. An example of this spherical wave propagation is shown in Figure 2.2. In this case, s(r,θ,φ,t)only depends onrandt. The simplified spherical wave equation then becomes

1 r²

∂

∂r

r² ∂

∂rs

− _c¹₂_∂t^∂²₂^s=0. (2.8) The most straightforward solution to this, just like the solution for the Cartesian wave equation, is the monochromatic one

s(r,t) = ^A

r exp(j(ωt−^kr)). (2.9) From the dispersion relation [1] we know thatk = ^w_c, so Equation 2.9 can be simplified by

s(r,t) = ^A

r exp(j(ωt−^kr))

= ^A

r exp(j(ωt− ^ω_c^r))

= ^A

r exp(jω(t− ^r_c))

= ^s(t− ^r_c) r ,

(2.10)

wheres(t) = Aexp(jωt)is the source signal.

O

r

2r

3r

Figure 2.2: Propagation of a spherical wave

2.1.2 Decibel Scale

The decibel scale is commonly used as a measure of signal loudness and is a logarithmic unit used to determine the ratio between two values. The

(21)

formula for decibels is given by:

β=10 log₁₀ P

P_{re f}

, (2.11)

wherePis the power of the measured signal, andP_{re f} is the reference value.

When calculating the Sound Pressure Level (SPL) for audio loudness, the reference value is given by P_{re f} = 10⁻¹²W/m² — the lower threshold for human hearing. Throughout this thesis, a reference value ofP_{re f} =1 will be used to avoid applying new reference values for the different calculations.

2.1.3 Near-field and Far-field

The two different types of waves described in subsection 2.1.1 create two slightly different problems when it comes to beamforming. If the source is close to the array, in the near-field, the wave at the microphones is spherical. If the source is sufficiently far away, this spherical wave appears as a plane wave.

From Figure 2.3 we can see that when a spherical wave approaches our array, we have to delay each element differently to compensate for the spherical shape of the arriving wave. The delay directions are illustrated with dotted lines from each of the microphone positions.

s(t)

ˆk

Figure 2.3: Spherical wave, near-field

When we have a plane wave as in Figure 2.4, the delays for the sensors become a function of the incoming angle instead of source position.

The two cases can be separated into near-field (spherical wave) beamforming and far-field (plane-wave) beamforming, which will be further dis- cussed in section 2.3.

(22)

s(t)

ˆk

Figure 2.4: Plane wave, far-field

The transition from a spherical to a seemingly plane wave is gradual, so there is no firm limit on when a wave from a point source can be said to be plane. Because of this, several different limitsdhave been established:

• d = ^D_4λ² — Fresnel limit

• d = ^π₄^D_λ² — Diffraction limit

• d = ^2D_λ² — Fraunhofer distance,

where D denotes the array size andλis the wavelength as before. We see that the common factor in all of these limits is ^D_λ², the other is just a scaling constant. In Figure 2.5 this common factor is shown for the case where we have a unit sized,D= λ, array.

Source x

y

1 Wavelength

Near Field Region

2 Wavelengths

Transition Zone

2 Wavelengths to∞ Far Field Region

Figure 2.5: Near-field, far-field limit

In our case, we will be using arrays with up to 1 meter in diameter. If we consider a 1 kHz wave we get a far-field limit of

D²

λ = _{340 m/s}¹

1 kHz

≈^{3 m.}

(23)

Which is a reasonable distance for our arrays to be placed away from the source. We are however sampling at f s=44100 Hz, so our worst caseλis given byλ= _{f s/2}^c = ₂₂₅₀₀³⁴⁰ ≈0.015. This leads to a far-field limit of

D²

λ_min = ¹

0.015 ≈^67m,

which means we will see a gradually increasing error for all frequencies above 1 kHz if an array is placed at this distance and plane waves are assumed. Because of this, all waves are presumed to be spherical when calculating the time-delays at each sensor. This also has the beneficial effect of giving correct delays for plane waves as the spherical wave delays gradually transition towards the plane wave delays when the distance between source and receiver becomes large.

(24)

2.2 Signal classification

Audio classification is a two-step process, where the first step is to extract relevant features from the sound to reduce the amount of data to be processed. These features are then fed to a machine learning algorithm which classifies the data.

There is a big assortment of different features to choose from when classifying audio. This section will try to cover the most relevant ones when working with classifying arbitrary sounds. The selection of the most used features is often based on speech and music properties, which may lead to a limited classification performance for sound in general. Because of this, we will try to focus on features that are applicable in any audio classification scenario. As an illustration of what each of the following features provides in a machine learning setting, the short clip of a faulty car engine shown in Figure 2.6 will be used, containing a faulty noise starting after about 1.5 seconds. The feature extraction has been done using the Python Librosa library [2].

0 1 2 3 4

Time [s]

−^0.3

−^0.2

−0.1 0.0 0.1 0.2 0.3

Amplitude

Faulty noise begins

Figure 2.6: Waveform of faulty car engine

2.2.1 Amplitude/Power

The most elementary form of detection would be to chart signal changes in the time domain, such as a spike in received energy, like the one we can see in Figure 2.6. One way to do this is to calculate the Root Mean Square Energy (RMSE) of a windowed portion of the signal, and do this for each

(25)

consecutive frame in the signal. The RMSE of a signal is calculated as

x_RMSE= v u u t

1 N

N−1 j

∑

=0

x²_j

!

(2.12)

In Figure 2.7 we see the RMSE of the signal in Figure 2.6, with a frame length of 2048 samples and a hop length of 512 samples. It is evident that there is a spike in energy when the faulty signal begins. Utilizinga priori knowledge that the first second of the recording is the normal operation, we can easily see that the system is out if its normal state of operation.

However, using the signal energy alone to determine if the system is faulty or not is most likely not feasible. The increase of power in Figure 2.7 could just as easily be the driver stepping on the gas pedal, increasing the Rounds per Minute (RPM).

0 1 2 3 4

Time [s]

0.01 0.02 0.03 0.04 0.05

RMSEnergy

Faulty noise begins

Figure 2.7: RMSE of faulty car engine

2.2.2 Frequency spectrum

To do more advanced audio analysis and feature extraction, we have to do the transition over to the frequency domain. By Fourier transforming the signal, we get a representation of the signal’s frequency components. In Figure 2.8 we see the frequency representation of the signal in Figure 2.6.

However, using the frequency spectrum of the whole signal is unfavorable when it comes to audio analysis and feature extraction, as it does not show the changes within the short time-frame of interest.

(26)

−²⁰⁰⁰⁰ −¹⁵⁰⁰⁰ −¹⁰⁰⁰⁰ −⁵⁰⁰⁰ ⁰ ⁵⁰⁰⁰ ¹⁰⁰⁰⁰ ¹⁵⁰⁰⁰ ²⁰⁰⁰⁰ Frequency [Hz]

0 20 40 60 80 100

Magnitude

Figure 2.8: FFT of faulty car engine 2.2.3 Spectrogram

The change in frequency as a function of time can be shown using a spectrogram, which can be calculated using the Short-Time Fourier Transform (STFT). The STFT divides the signal into shorter time-frames, and Fourier transforms each of them as shown in Figure 2.9. The spectrogram has a trade-off between resolution in frequency and resolution in time, as a higher frequency resolution requires longer time frames. In Figure 2.9, a frame length of 2048 samples is used with a hop length of 512 samples.

The spectrogram is the basis of most of the more complicated feature extraction methods used in audio classification. The Mel-Frequency Cepstral Coefficients (MFCCs) is one of these and has most dominantly been used in speaker and speech recognition.

(27)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time [s]

0 2500 5000 7500 10000 12500 15000 17500 20000

Frequency[Hz]

Faulty noise begins

-80 dB -70 dB -60 dB -50 dB -40 dB -30 dB -20 dB -10 dB +0 dB

Figure 2.9: Spectrogram of faulty car engine 2.2.4 Linear Frequency Cepstral Coefficients (LFCCs)

Instead of using the more conventional Mel frequency scale, the cepstrum coefficients are calculated using a linear scale. This is done to avoid imposing restrictions on the higher frequency components, as we have no a prioriknowledge of what frequencies might contain the part of the signal that is of interest. This does, however, leave the feature less robust to white noise due to the higher number of filter banks in the high-frequency region [3]. Calculating the cepstral frequency coefficients is done as follows [4]:

1. Fourier transform a section of the signal.

2. Apply a triangular filter bank to the power of the spectrum, sum energy in each filter.

3. Take the logs of the powers at each of the frequencies.

4. Take the Discrete Cosine Transform (DCT) of the log powers.

The LFCCs are the amplitudes of the resulting spectrum. The only difference between LFCCs and MFCCs is the triangular filter bank used in step two above shown in Figure 2.10. To understand why we have decided to use a linear scale, a comparison between the linear and Mel frequency scales is needed. There is no single mel-scale formula, the one used in Figure 2.10 is the popular formula given by O’Shaugnessy in [5] as

m=2596 log₁₀

1+ ¹ f

, (2.13)

where f is the frequency in Hz. It is however not the actual scale used that

(28)

is significant, but the fact that the Mel scale is to be understood as a linear frequency spacing below 1 kHz and logarithmic spacing above 1 kHz [4].

0 2500 5000 7500 10000 12500 15000 17500 20000

Hertz scale 0

2000 4000

Melscale

Mel scale vs Hertz scale

0 5 10 15 20

Frequency [kHz]

0.0 0.5 1.0

Amplitude

Mel filter bank

0 5 10 15 20

Frequency [kHz]

0.0 0.5 1.0

Amplitude

Linear filter bank

Figure 2.10: Linear vs Mel scale

(29)

The LFCC ∆ and∆² values seen in Figure 2.12 are local estimates of the first and second order derivative of the LFCCs, which are often used in combination with the cepstral coefficients to improve performance. To calculate the delta coefficients, the following formula is used

dt = ^∑

Nn=1n(ct+n−^ct−ⁿ)

2∑^Nn=1n² , (2.14)

wheredtis the delta coefficient for frame t,ctis the is the cepstral coefficient for time t, and N is the number of samples used in the local derivative estimates.

When using the cepstral coefficients for a machine learning purpose, the higher coefficients are usually discarded, along with the first coefficient.

The first coefficient is the sum of all the log-energies, and can be interpreted as a measure of signal loudness. Discarding this coefficient will make the system more robust to loudness variations - a property often desired in audio classifying applications. The higher coefficients are often discarded since we are usually more interested in the general shape of the spectrum represented by the lower coefficients than the noisy fine details contained in the higher coefficients.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Time [s]

0 5 10 15 20 25 30 35 40

LFCCs

Faulty noise begins

−150

−100

−50 0 50 100

Figure 2.11: LFCCs of faulty car engine

(30)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time [s]

0 10 20 30 40

LFCC-∆s Faulty noise begins

−20

−15

−10

−5 05 1015 20

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Time [s]

0 10 20 30 40

LFCC-∆2s Faulty noise begins

−1.5

−1.0

−0.5 0.0 0.5 1.0

Figure 2.12: LFCC-∆s of faulty car engine

(31)

2.3 Beamforming

2.3.1 Arrays

Array signal processing serves several purposes and can be used to enhance the Signal-to-Noise Ratio (SNR) compared to a single sensor, locate the number of sources and their positions and their waveforms, as well as tracking moving sources in space, such as an airplane or a drone.

Spatial Sampling

The difference between continuous apertures and arrays is the fact that arrays consist of individual elements that sample the signals spatially.

Each of these sensors has an aperture of itself, but in this thesis, all the microphones are assumed to be isotropic to simplify the equations.

From the classical digital signal processing the Nyquist-Shannon sampling theorem [6] states that we have to sample an analog signal with a sampling interval of∆t< ^λ^min₂ or equivalentlyf s>2f_max(the Nyquist rate) to be able to recreate the signal. Sampling at a lower frequency than the Nyquist rate leads to aliasing [7], and the signal will be distorted. The same principle applies to array signal processing if the arrays are regularly spaced. The arrays need a sampling frequency (given by distance between elements, similar to delay in time when sampling an analog signal) of

d< ^λ

2 (2.15)

wheredis the distance between the microphone elements. One manifestation of undersampling with an array is the occurrence of the so-calledgrat- inglobes, which is the spatial manifestation of aliasing. These spatial “main- lobes” might appear within the aperture smoothing function in Figure 2.13 and lead to ambiguous directional responses. If we have gratinlobes within the visible region, signals propagating from directions corresponding to the locations of the lobes will be indistinguishable from the signals arriving in the direction of the true mainlobe. Another way to mitigate the effect of spatial aliasing is by using irregular microphone spacing [8], but we will not go further into the details of this as the arrays used in this thesis have already been designed.

The discrete aperture smoothing function, or “beampattern” as it is also called, is given as the Discrete Fourier Transform (DFT) of the sensor weights for uniformly sampled arrays

W(k) =

M−1 m

∑

=0

w_me^jk^·^x^m (2.16)

(32)

An example aperture smoothing function for a 10 element ULA is shown in figure Figure 2.13.

−3π

d −2π

d −π

d 0 ^πd 2π

d 3π

d

Wavenumberkx

−60

−50

−40

−30

−20

−10 0 10

W[dB]

Grating

Lobe Grating

Main Lobe Lobe Sidelobes

Figure 2.13: Aperture smoothing function for a ULA with 10 sensors, with d = ^λ₂. The area between the two dashed lines at−^2π_λ ^and ^2π_λ is called the visible area and is drawn for a critically sampled array whereλ= d/2.

2.3.2 Delay and Sum (DAS)

The most straightforward beamformer is the delay and sum beamformer.

The concept is to delay the sensor outputs by an appropriate amount of time to reinforce a signal coming from a particular direction and then sum all the channels constructively. If each channel output at time t, given by xm(t) is weighted with wm and then time-delayed by a factor ∆m, the beamformed output becomes

y(t) =

M−¹ m

∑

=0

wmxm(t−^∆m), (2.17) where ∆m is the time delay for channel m corresponding to the arrays steering direction.

For the far-field case described in subsection 2.1.3, we have a plane waveform s(t) propagating across our array of M sensors in a direction ξ. The recorded wavefield in a noise-free environment is then given as

x_m(t) =s(t−^α·^xm), (2.18) where α = ξ/c is the slowness vector. After applying a delay∆m at the

(33)

Σ

x0(t) τ0 w0

x1(t) τ1 w1

x2(t) τ2 w2

x_M−1(t) τM-1 wM-1

...

y(t)

Figure 2.14: Delay, weight and sum beamformer.

m-th sensor, weighting, and summing, we get y(t) =

M−1 m

∑

=₀

w_ms(t−∆m−^α·^xm), (2.19) and choosing∆m =−^α·^xmgives us the beamformed output

y(t) =

M−¹ m

∑

=0

wms(t). (2.20)

For the spherical (near-field) case, we have to modify the delays at the sensors are calculated. We can no longer steer towards the propagation direction; instead, the beam has to be steered towards a spatial location.

From section 2.1 we know that the expression for a spherical wave is s(r,t) = ^s⁽^t⁻_r^r/c⁾ — it depends on time and distance to the source, not propagation direction. The recorded soundwave on each sensor then becomes

xm(t) = ^s(t−^rm/c)

r_m , (2.21)

wherer_mis the distance from the source to sensorm.

If we choose our delay to be ∆m = (r−^rm)/c, where r is the distance from the source to the origin of the array, then our delay and sum output becomes

y(t) =

M−¹ m

∑

=0

wms(t−^rm/c−[r−^rm]/c) r_m

=s(t−^r/c)

M−1 m

∑

=₀

w_m 1 rm

= ¹

rs(t−^r/c)

M−¹ m

∑

=0

wm r r_m .

(2.22)

(34)

From this we see that 1

rs(t − ^r/c) is the spherically spreading wave received at the origin of the array, while ^M

∑

⁻¹

m=0

w_m r

r_m is the weighted sum of sensor weights.

If we have a monochromatic source, the delays can be implemented as phase shifts instead of time delays.

x_m(t) =e^j⁽^ωt⁻^k^·^x^m⁾

xm(t−^∆m) =xm(t)e⁻^jω∆^m (2.23) For a uniformly weighted array, we can then define

w= ^e

M, (2.24)

where

e =





 e^jω∆⁰ e^jω∆¹

...

e^jω∆^M⁻¹







(2.25)

is called the steering vector, which contains the phase delays to steer towards a given direction. By using this steering vector we get the following equation for the beamformer output

y(t) =w^HX(t) = ¹

Me^HX(t), (2.26) where

X(t) =





 x0(t) x₁(t)

...

x_M₋₁(t)







. (2.27)

2.3.3 Minimum Variance Distortionless Response (MVDR) In subsection 2.3.2 we saw that we could represent a phase-shift beamformer on vector form as

y(t) =w^HX(t). (2.28) One of the most popular adaptive beamformers is the minimum variance distortionless response beamformer, first proposed by Capon [9]. This

(35)

beamformer adaptively tries to reduce the output power of the beamformer while keeping unit gain in the steering direction. The resulting weights can be interpreted as a sharp spatial bandpass filter which tries to mitigate the limitations of the delay and sum beamformer, for instance, its ability to differentiate two closely spaced sources.

Calculating the output power of the beamformed signal y(t) is done as follows

P(y(t)) =E{|^y(t)|²}=E{(w^HX)(w^HX)^H}=E{^w^H^{X X}^H^w}

=w^HE{^{X X}^H}^w=w^HRw, (2.29) where R is called the spatial correlation matrix. Equation 2.29, together with the constraint of unit gain in the steering direction, leads to the minimization problem

minw

w^HRw

, w^He=1. (2.30)

Solving this using Lagrange multipliers [10] gives the optimum MVDR weights

w= ^R⁻¹^e

e^HR⁻¹e. (2.31)

The effect of the adaptive weighting to remove interference can be seen in Figure 2.15, where we can see the null-placement in the beampattern where the interference is arriving from.

75 50 25 0 25 50 75

[deg]

60 40 20 0

Power [dB]

MVDR beampattern steered to 0

75 50 25 0 25 50 75

[deg]

60 40 20 0

Power [dB]

DAS beampattern steered to 0

75 50 25 0 25 50 75

[deg]

60 40 20 0

Power [dB]

MVDR beampattern steered to 15

75 50 25 0 25 50 75

[deg]

60 40 20 0

Power [dB]

DAS beampattern steered to 15

Figure 2.15: MVDR vs DAS weights for 10 element ULA with d = λ/2 spacing, and signals arriving from −¹⁵^◦^{, 12}^◦ ^{and 35}^◦ ^{with equal} amplitudes.

(36)

Diagonal Loading

Calculating spatial correlation matrix Rassumes that we know the exact signal properties, i.e., an infinite observation time. It is clear that this is not viable in practice. A natural estimate ofRis the sample covariance matrix

Rˆ = ¹ M

M−¹ k

∑

=0

x[k]x^H[k]. (2.32) Since calculating the MVDR beamformer weights requires a matrix inversion, this method can be numerically unstable. Inadequate estimation of the correlation matrix may also result in beampatterns with high sidelobes and distorted main beams. We can compensate for this instability issue by adding an adaptively scaled identity matrix to the correlation matrix, also called diagonal loading, before performing the matrix inversion [11]

Rˆ⁻_dl¹ =

Rˆ + ^δ

MTr(R^ˆ)I ₋₁

, (2.33)

where the subscriptdlstands for diagonal loading andδis a user-defined scaling factor. This scaling factor can however not be too significant, as choosing a largeδmakes the weights converge towards the DAS weights.

This loss of adaptivity when diagonal loading is applied arises because it makes the correlation matrix more similar to a scaled identity matrix:

δlim→∞Rˆ_dl =δI. (2.34) If we insert this into the weight calculation in Equation 2.31, we get

w= ^δI⁻¹^e

δe^HI⁻¹e = ^e e^He = ^e

M, (2.35)

which is the same set of weights we had for the phase-shift DAS beamformer.

Broadband MVDR

The broadband (subband) MVDR can be implemented by applying the DFT to each of the microphone channels. To do this, we first have to buffer Nsamples, which results inNsubbands. The narrowband MVDR weights are then applied to the center frequencies of each subband, summed to a broadband result and converted back to a time-domain signal using the Inverse Discrete Fourier Transform (IDFT). Displayed in Equation 2.3.3 is the block diagram for the broadband MVDR beamformer.

(37)

x0 DFT MVDR beamformer for subband 0

x1 DFT MVDR beamformer

for subband 1 Σ IDFT

xM−1 DFT MVDR beamformer

for subbandN−1 ...

y(k)

Figure 2.16: Block diagram for broadband DFT MVDR beamforming Implementing the delays through phase-shift does lead to some unwanted effects if the number of samples N in the DFT is too low. Having N subbands means that the spacing between each of the Fourier coefficients will be given by _N^fs, where f s is the sampling frequency. If we have a sampling frequency of fs = 44100 and perform a 1024-point DFT, we get a sampling interval in the frequency-domain of∆f = ⁴⁴¹⁰⁰₁₀₂₄ ≈ ^{40 Hz. We} know from subsection 2.3.2 that the phase shift for a single frequency is given by the steering vector

e=





 e^jω^∆⁰ e^jω∆¹

...

e^jω^∆^M⁻¹







. (2.36)

This means that our worst case scenario is a frequency mismatch between the steering vector e and the data contained in corresponding frequency binmis given by _2N^{f s} — which essentially is the same as having a gradually increasing steering mismatch for all frequencies in between two frequency bins. This is especially true for the lower frequencies. Considering a 100 Hz signal being phase-shifted byφ=2πf∆, where f =120 Hz, would lead to a 20% error in the phase-shift. With this in mind, we have to make sure to use an adequate number of samples when calculating the DFT.

There are two restraints to the number of samples we can use. First of all, having a high number of samples means calculating a large number of weights, as we needMweights for each of the frequency bins, which again leads to a higher computational complexity for the algorithm. The other limitation is the number of samples we can buffer before performing the DFT. If our sound sources are stationary, this number can be as large as we want it, and we might even save on computational complexity by not updating the weights for every iteration. In a more dynamic environment the weights have to be updated more frequently, which means we have to buffer a lower number of samples before updating the weights. In this

(38)

case there is a trade-off between the buffer size and the adaptability of the beamformer.

2.3.4 Generalized Sidelobe Canceller (GSC)

In subsection 2.3.3 we derived the MVDR beamformer by imposing the linear constraint

w^He=1. (2.37)

The Linearly Constrained Minimum Variance (LCMV) beamformer is an expansion of the MVDR beamformer, which allows for more than one constraint to be imposed on the system. The new constraints are defined as

w^HC =g^H, (2.38)

whereCis the constraint matrix, which has linearly independent columns, andgcontains the constraint values. Adding a distortionless constraint to the LCMV beamformer would mean that one of the rows inC,c_i, has to be c_i = eand the corresponding i-thelement in g has to be g_i = 1. Finding the optimum weights by minimizingw^HRwwith these constraints can be done in the same fashion as for the MVDR beamformer by using Lagrange multipliers [12]. This leads to

w_lcmv^H =g^Hh

C^HR⁻_n¹Ci₋₁

C^HR⁻_n¹. (2.39) One possible implementation of the broadband LCMV beamformer is the Generalized Sidelobe Canceller. The GSC closely resembles the Frost beamformer [13] but is an unconstrained version of the minimum variance beamformer. It divides the N-dimensional space into two subspaces, a constraint subspace, and an orthogonal subspace by using the blocking matrix B shown in Figure 2.17. This matrix filters out the signal from the steering direction and results in the (M−^C)×^N matrix containing the orthogonal subspace, where C is the number of constraints imposed on the system [14]. The upper, non-adaptive beamformer is the constraint subspace.

Dividing the signal space into two subspaces means we can write our GSC weights as

w_gsc^H =w_c^H−^w^Hp, (2.40) wherew_cis the projection ofw_gsc onto the constraint subspace, andw^H_p is the projection of w^H_gscontoB. The projection onto the constraint subspace

(39)

is given by [12]

PC =Ch

C^HCi₋1

C^H, (2.41)

which leads to

w^H_c =w^H_gscPC (2.42)

=g^Hh

C^HCi₋₁

C^H. (2.43)

w^H_p can be written asw^H_a B^H, wherew_ais given by [15]

w_a^H =w_c^HR_nBh

B^HR_nBi₋₁

(2.44)

X(t)

w

c

B w

_a

y

c

(t) z(t)

y

a

(t)

Figure 2.17: Block diagram of the GSC

There are many variations of the GSC, but the one used in this thesis is the beamformer described by Griffiths and Jim [15], shown in Figure 2.18.

In this beamformer, the upper path consists of a DAS beamformer, using uniform weights in our case. This upper path ensures that the signal in the steering direction passes through the system without distortion. The lower, adaptive path tries to block out the signal of interest while letting the noise and interference through by using the blocking matrixB. It then adaptively creates a noise estimatey_ato be subtracted from the upper path.

We know from subsection 2.3.2 that the output from the upper path is y⁰_c[k] =w_c^HX[k], (2.45) where X[k] is the sampled and time-delayed version of the input signals on the form X[k] = [x₁[k], x₂[k], . . . , x_m[k]]^T. To simplify notation, the coefficients in w_c are assumed to be normalized to sum to unity, e.g., [1/M, 1/M, . . . , 1/M]^T for uniform weighting. This is then filtered

(40)

x1

τ1 wc1

x2

τ2 wc2 Σ ^Tapped_delay

Fixedf(l) constraints

xM-1

τM-1 wcM-1

B Wa

x1[k]

x2[k] y⁰_c[k] yc[k] +

xM-1[k]

...

y[k]

...

x⁰₁[k]

x⁰₂[k]

x⁰_M-C[k]

...

Σ ...

ya[k]

−

Adaptive LMS algorithm

Figure 2.18: Diagram of LMS based GSC, adapted from [15]

through an Finite Impulse Response (FIR) filter f[l] containing the constraint values. One common constraint is that of zero distortion, in that case, the FIR filter merely is f[l] = δ[l], and the expression for the upper path becomes

y_c[k] =

∑

K l=−^K

f[l]y⁰_c[k−^l]. (2.46) In the adaptive path, the presteered signal first passes through the blocking matrix B, removing as much as possible of the signal from the steering direction. Since the desired signal is common to each of the steered outputs, blocking of the signal is ensured as long as the rows ofBsum up to zero.

Correct steering is however critical for this step, as incorrect steering leads to the SOI leaking through the blocking matrix.

We useX⁰[k]to denote the output ofB, such that

X⁰[k] = BX[k]. (2.47) The adaptive filters denoted as W_a consists of L = 2K+1 FIR filters of length (M−^C) — the same number as the length of the FIR filter in the upper path. If we let W_a,l denote the l-th filter of the adaptive weight matrix, the output of the sidelobe canceling path is

ya[k] =

∑

K l=−K

W_a,l^T[k]X⁰[k−^l]. (2.48) The output of the entire beamformer is then given as

y[k] =yc[k]−^ya[k]. (2.49)

Beamforming for condition monitoring

Beamforming for condition monitoring

Nikolai West Orheim

Thesis submitted for the degree of Master in Electrical Engineering

(Signal Processing) Department of Physics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Beamforming for condition monitoring

Nikolai West Orheim

Abstract

Contents

List of Figures

List of Tables

Acknowledgment

Chapter 1

Introduction

1.1 History and Introduction to Concepts

1.2 Motivation and Goals

1.3 Outline of the Thesis

Acronyms

Mathematical Notation

Chapter 2

Background Theory

2.1 Propagating sound waves

2.2 Signal classification

∑

2.3 Beamforming

∑

∑

∑

∑

∑

∑

∑

∑

∑

X(t)

w

B w

y

(t) z(t)

y

(t)

∑

∑