The use of Java in different signal processing applications including a text independent speaker recognizer based on cepstral coefficients mathematical modeling

(1)

Department of informatics

The Use of Java in

Dierent Signal

Processing Applications

Including a Text

Independent Speaker

Recognizer Based on

Cepstral Coecients

Mathematical modeling

Kjetil Pedersen

November 2000

(2)

(3)

Theresultsinthesethesisshowthattextindependentspeakerrecognitionin

a speech utterance composed of several speakers based on a coarse average

of thecepstrum coecientswillnot givea satisfactoryresult. Though Ind

thatitcouldbeusedwithsuccessinsimplerspeakervericationapplications.

(4)

(5)

ThisthesisiswrittenasarequiredpartoftheCandidatusScientiarum(Mas-

ter of Science) degreein informaticsatthe Departmentof Informatics,Uni-

versity of Oslo, Norway. The work was started in January 1999 and ended

inNovember2000.

The work isa practicaland theoretical approachto study the use of the

programminglanguageJavainmedicalultrasound,instreamingapplications

fortheInternet,andinspeechprocessing. Duetosomeproblemsthatwillbe

describedmorethoroughlyintheIntroductionchapter,thespeechprocessing

part is the main part in the thesis. Some background theory is included to

motivate the reader and to introduce the notationrequired inthe proposed

methods. This willhopefullygiveawiderunderstanding ofthemethodsand

the conditions they apply to.

Finally, I would like to thank my supervisor, Professor Sverre Holm, for

his encouragement and assistance through this work. I would also like to

thank all my fellow students, and Helge Fjellestad in particular for helping

me with technical as well as theoretical problems during this work, and my

family,neighborsand friends for helpingme readingthe proofs and forpar-

ticipationwhile carrying through the process of collectingthe speech data.

(6)

(7)

1 Introduction 1

1.1 The objectiveof this thesis . . . 1

1.2 Java . . . 3

2 Medical Ultrasound Imaging 5 2.1 MainPrinciples . . . 5

2.2 A,B and M modes of ultrasound. . . 6

2.2.1 A-mode . . . 6

2.2.2 B-mode . . . 7

2.2.3 M-mode . . . 8

2.2.4 Near-eldand far-eld . . . 8

3 Networking 11 3.1 Motivation . . . 11

3.2 UDP,RTP and RTCP . . . 12

3.2.1 Protocols used for streaming Media . . . 13

3.2.2 RTP Services . . . 14

3.3 Compression . . . 15

3.3.1 Lossless compression algorithms . . . 15

3.3.2 Lossycompression algorithms . . . 16

3.3.3 The Java Media Framework . . . 22

3.3.4 The Java NativeInterface . . . 23

4 Low-rate Speech Coding 25 4.1 Notation . . . 26

4.2 BasicSpeech Properties and Historical Overview. . . 26

4.3 Dierent Types of Vocoders . . . 29

4.4 How toMeasure the Performance . . . 31

4.5 PhysicalAspects of Speech Modeling . . . 32

4.6 AutoRegressive MovingAverage Models . . . 33

(8)

4.6.2 Autoregressive Processes . . . 34

4.7 The WAV leformat . . . 34

4.8 Linear PredictiveCoding and Transmission of LP-parameters. 36 4.8.1 LinearPredictive Coding (LPC) . . . 36

4.8.2 LogAreaRatios (LARs) . . . 38

4.8.3 Line Spectral Pairs(LSPs) . . . 39

4.8.4 Cepstral coecients. . . 39

4.9 CELP coders . . . 40

4.9.1 The conventional CELP coder . . . 41

4.9.2 MainDierences Between the Original CELP and the MPEG4-CELP . . . 45

5 Method and implementation 49 5.1 The software used in encoding the les . . . 49

5.2 Experiment setup . . . 50

5.3 Computation of the regressive cepstrum conversion . . . 51

5.4 Segmenting the speech samplewhenmorethan onespeakeris present . . . 52

5.5 Howthe tests were carried out . . . 56

6 Results and Discussion 59 6.1 Textdependentspeakeridenticationbased onveequalsen- tences . . . 59

6.2 Text dependent speaker identication based on ve dierent sentences . . . 62

6.3 TextIndependent Speaker RecognitionBasedon OnlyDier- ent Sentences . . . 65

6.4 Experiences withthe use ofJavaindierentsignalprocessing applications . . . 68

7 Conclusion and further work 71 A Selected parts of the Java source code 74 A.1 Source code for reading WAV data . . . 74

A.2 Source code for segmentation of speech . . . 77

A.3 Source code for various lters . . . 82

(9)

2.1 Pictureof a LOGIQ700 MR Transducer . . . 6

2.2 Panoramic TEE (transesophageal echocardiography, an imaging system) of enlarged rightatrium.. . . 7

2.3 An illustration of incoming curved waves (near-eld) to the left,and plane waves (far-eld) tothe right. . . 9

3.1 The Internet architecture. . . 12

3.2 The RTP architecture. . . 13

3.3 The three mainsteps inthe JPEG encoding. . . 17

3.4 Zigzagencoding used inJPEG. . . 18

3.5 Sequence of I, P,and B frames generated by MPEG. . . 19

3.6 Eachframe asa collectionof macroblocks. . . 21

4.1 The periodicity obtained when pronouncingthe voiced sound /a/. . . 27

4.2 Therandomliketime domainplotofthe pronunciationof the unvoiced sound /sh/. . . 27

4.3 The harmonic frequency domain plot of the voiced sound/a/. 28 4.4 The broad band spectrum plot of the pronunciation of the unvoiced sound /sh/. . . 28

4.5 The time domainplot of the plosivesound /p/. . . 29

4.6 The frequency domain plot of the plosive sound/p/. . . 30

4.7 The engineeringmodelforspeech synthesis. . . 31

4.8 A simpliedvocaltract. . . 32

4.9 Cepstral plot of a male speaker pronouncingthe vowel /a/. . . 40

4.10 Cepstralplotofamalespeakerpronouncingthe unvoicedfric- ative /s/. . . 40

4.11 Synthesis sectionof a simple vocoder. . . 41

4.12 CELPencoder. ThecapsulatedregionequalstheCELPdecoder. 41 4.13 Cells for two-dimensionalVector Quantization (VQ). . . 43

(10)

the noise, and the beginningand ending of the utterance. . . . 53

5.2 Energy plot for the word "multiply" with markers indicating

the noise, and the beginningand ending of the utterance. . . . 54

5.3 Energy plot of two dierent speakers. Notice the dierencein

energy. The rst speaker is a male,the seconda female.. . . . 54

5.4 Flowchart illustrationforthebeginningpointestimationofan

utterance. . . 56

5.5 Flowchart illustration for the end point estimation of an ut-

terance. . . 57

6.1 Plot of the averages of the cepstral values for each utterance.

All sentences were equal. . . 62

6.2 Plot of the averages of the cepstral values for each utterance

in the second test. All speakers pronounced the same ve

dierent sentences. . . 65

6.3 Waveformfor the beginningof the word "six". . . 67

6.4 Waveformfor the end of the word "ve". . . 67

6.5 Plot of the averages of the cepstral values for each utterance

in thesecond test. Allutterances forallspeakers were dierent. 68

6.6 The SpeechAnalyzer frame . . . 70

(11)

2.1 Sound velocity in dierent biological and non-biological ma-

terial,found in[8]. . . 9

4.1 The WAV leformat. . . 35

4.2 The table illustratesthe main steps inthe Durbin recursion. . 38

4.3 A summaryof performance for some coders. . . 44

4.4 The MOS scale. . . 45

4.5 Fixed bitrates supported forspeech sampledat 8 kHz. . . 46

4.6 Fixed bitrates supported forspeech sampledat 16kHz. . . . 46

4.7 A summaryof the decoder complexity levels. . . 47

6.1 A summary of the performance of the algorithm in a text

dependentspeakerrecognition application. Allthe utterances

were equal. . . 60

dependentspeakerrecognition application. Allthe utterances

were equal. . . 61

dependent speaker recognition application. All the sentences

were dierent, but all the speakers pronounced the same ve

dependent speaker recognition application. All the sentences

were dierent, but all the speakers pronounced the same ve

6.5 The results after testing the text independent speaker recog-

nition application on 10 test samples with dierent training

sets. . . 66

(12)

(13)

Introduction

1.1 The objective of this thesis

Theobjectiveofthisthesisistoexplorethepossibilityofusingthe program-

ming language Java in a number of dierent signal processing applications.

This is done on three slightly dierent tasks - an interaction between Java

and C++ to process ultrasound data, a streaming application of video over

theInternet,and third,and thisisthemainpart,atextindependentspeaker

recognition application written in Java, that utilizes the encoded data from

the MPEG4-CELP coder, used for speech coding atlow bitrates, to distin-

guish dierent speakers in an input stream. I ought to mention that there

has been a couple of diculties during the work, which is the reason why

these thesis consistsof three parts. I willdeal with these problems belowas

I givea briefdescription of the three parts.

The rst taskis todevelop anapplicationwhere the Javaprograminter-

acts with a C++ programthat denes an ultrasound image format. This is

a co-operation with G E Vingmed Ultrasound, and the question is to nd

out whether or not Java is suitable for this purpose; can it process and up-

date the images quickly enough? As the ultrasound imageformat is dened

in C++, I have to utilise the Java Native Interface (JNI) API (Application

ProgrammingInterface)tointegrateJavawithC++. InthispartIshowhow

one with success can use Java forthe user interface part of a computational

expensive task,and makea connectionto the C++programthat takescare

of theheavy computations. The interconnection between Javaand C++(or

Visual C++ as I used) worked ne, but I could not get the image right. It

didnot lookliketheultrasoundimageitwassupposedto,andIwasnotable

tosolvethe error.

Further,inthesecondpart,whichisaco-operationwiththeInternetcom-

(14)

Java Media Framework (JMF), to exploit the possibility of streaming video

over the Internet. This package is described more thoroughly in Chapter

3.3.3. The purpose of this part of the work was rst tomakean application

that was able to stream well known media formats likeMPEG, WAV, MP3

and AVI les, and thento implementacodec (encoder/decoder) for FAST's

own video format, the FVT (Fast Video Transfer) format. The rst part

worksjust ne, the JMFpackage is wellsuited for use instreaming applica-

tions,but inthe secondphase, aproblemthat treatedcompany condential

parts came into existence. To implement a codec for the FVT format, it

is, of course, necessary to get access to the source code. But, as this code

is company condential, FAST would not let this source code be stored on

one of the university's hard drives. The solutionwould be for meto dothis

part of the work in one of FAST's oces, but as they were already in lack

of freeoce space,this was not possible. So, instead,itwas decided totake

another approachand look atthe way that soundis compressedinthe FVT

format. What they use for this part of the media stream is the newly de-

veloped MPEG4 audio format, which is an open source format. This leads

tothe third part of this thesis.

Inthelastpartthequestionishowwellsuitedthe encoded datafromthe

MP4 encoder are in a speaker recognition application, and especially a text

independentspeakerrecognitionapplication. Therewouldbeagreatcompu-

tationalbenetif itispossibletoavoiddecodingthe data beforewe analyze

them. In this part I show that the data, when transformed to so-calledcep-

stral coecients can beused incertain speaker verication application, but

moresophisticatedmethodsshouldbeused inthe textindependentrecogni-

tiontasks,and toimprovethe resultswhendealingwithspeakerverication.

With this simple method the error rate is too high to be used for practical

purposes, but if the methodis exendedit shouldbe possible toobtainfairly

goodresults. Theseextensions are suggested inChapter 6,Results and Fur-

ther work. I start with aspeaker verication test, one where allthe training

utterances are the same, and one wherethey are all dierent, before I move

ontothetextindependetspeakerrecognitiontest. Thisisamorechallenging

task, as I have to segment the speech sample into speakers. The diculties

this involves isdescribed more thoroughlyin the Methodsection.

Even though the rst two projects were not completed the way they

were supposed tobe,I did use quitean amountof time reading background

materialandprogramming. ThereforeIdecidedtoincludesomeofthistheory

inthese thesis.

(15)

Java was introduced in the late 1995, and it took the Internet with storm.

One of the reasons why it became so popular was the ability to add fancy

graphics and music to your web-pages. The language, developed by Sun

Microsystem,isplatform-independent. Thismeans thatyouwriteyour code

once and can then run the program on any machine-architecture. This is

another- and the main - reason for the popularity.

The platform-independence is at the same time one of the major draw-

backs. When youcompile the Javacode the programiscompiledintoarchi-

tecture neutrale byte-code format, and not architecturedependent machine-

code. To run this byte-code your machine must implement the Javavirtual

machine (JVM). The JVM is the interpreter and run-time system. Even

though the speed of the Java interpreter has increased tremendously over

the past years, it willstillnot run as fast ase.g. C-code or C++-code. Per-

haps this is just a matter of time as the compilers go through the process

of development. One has to rememberthat especially C isan old language,

and the compilerhas been optimized during several years.

(16)

(17)

Medical Ultrasound Imaging

Soundisoneofthemostimportantcarriersofinformation. Soundispressure

waves that propagate through a medium. When, for instance, a tuning-fork

isstruck againstsomething, itstarts tovibrate and inuencesthe molecules

in the surrounding air. These molecules will then start to vibrate with the

samefrequency asthe tuning-fork,and thevariationsinthe airpressurewill

propagate through the airwhich make the membranevibrate [3].

Sound with higher frequency than humans are able to hear, is called

ultrasound. The frequency is typically in the range 2 - 10 MHz for use in

medicalimaging. Ultrasoundhas been examinedfordecades, butit wasnot

before the 1930s and 1940s that the two brothers Friedrich and Theodore

Dussik 1

discovered its potentialin medicaldiagnostic. What reallyinspired

the earlyultrasound investigators was the SONAR (SOund Navigation And

Ranging)-thetechniqueofsendingsoundwavesthroughwaterandobserving

thereturningechoestocharacterizesubemergedobjects. Even thoughitwas

discovered inthe 1930s, the major breakthroughcame inthe 1970s withthe

B-mode presentation of two dimensional gray-scale imaging. I will discuss

B-mode imaging ina subsection below.

2.1 Main Principles

Theultrasound isgeneratedby aprobe (also calledtransducer). Onetypical

example of a probe is illustrated in Figure 2.1. It was the work done by

Pierre and Jacques Curie 2

that in 1880 led to the modern-day ultrasound

1

Karl Theodore Dussik,bornin Vienna, Austria, onJan. 9,1908, wasapsychiatrist

andneurologistat theUniversityofVienna

2

Pierre Curie, born in Paris on May 15, 1859. Married with the well known Marie

Curie.

(18)

Figure 2.1: Picture of a LOGIQ 700 MR

Transducer to press) whereby physical

pressure applied to a crystal

resulted in the creation of an

electric potential. The elec-

tric charge was directly propor-

tional to the force applied to

it. They also found the re-

versepiezoelectriceectthatoc-

curred when arapidly changing

electricpotentialwasappliedto

the crystal and caused it to vi-

brate. The essence of todays

ultrasound transducers is that

they contain piezoelectric crystals that expand and contract to interconvert

electricand mechanicalenergy.

Theinformationaboutdistanteventsiscarriedtothesensorsintheprobe

bythe reectionofthe wavesthatthe probetransmitted. Thephysicsofthe

wave propagationis described bye the wave-equation dened by

∂ ² s

∂x ² + ∂ ² s

∂y ² + ∂ ² s

∂z ² = 1 c ²

∂ ² s

∂t ² ,

^(2.1)

where

s = s(~ x, t)

îsâ^scalarêld,

c

^can^beinterpretedasthespeedofpropaga- tion, and

~ x = (x, y, z)

^. ^This ^equation ^can ^be ^determined ^from ^Maxwell's

equations, which describes electromagnetic waves [4]. It can also be solved

for the problemof a vibratingstring [5].

2.2 A, B and M modes of ultrasound

One example of an early ultrasound technique is the transmission method,

where a receiver was placed on the opposite side of the specimen being im-

aged. In this technique one measured the amount of sound that was not

absorbed [6]. Anothermethodwasthe pulsedreection method,andnowthe

transmitterand receiver was placedon the same side of the specimen.

2.2.1 A-mode

A technique that produced a one-dimensional image was the Amplitude or

A-mode ultrasound. What is being displayed is the amplitude along the

(19)

that returned to the transducer, the higher the spike.

In A-mode imaging we also talk about range resolution, determined by

the lengthofthe transmittedpulse

T _p

^, ^which^is ^inversely proportionaltothe transducer bandwidth

B _w

^. ^This ^resolution ^then ^becomes ^[7]

∆ _r = cT _p /2 = c/2B _w ,

^(2.2)

where

c

^is ^the ^sound ^velocity. ^The ^sound ^velocity ^depends ^on ^the ^material

the waves propagate through, anif welet

ρ

^be ^the ^mass ^density ^and

κ

^tyhe

volumecompressibilityweget

c = 1/ √

ρκ

^(2.3)

2.2.2 B-mode

The mostcommonlyused technique isthe Brightness orB-mode. Thistech-

niqueproducesatwo-dimensionalcharacterizationofthetissuewherebyeach

pixelona screenrepresents anindividualamplitude spike. Theseimagesare

gray-scale images where amplitude of varying intensity are assigned shades

fromblack to white. Figure (2.2) shows anexample of a B-mode image.

Figure2.2: PanoramicTEE (transesophageal echocardiography, animaging

system) of enlarged right atrium.

(20)

Another mode is the M-mode or Motion-mode imaging, which relates the

amplitude ofthe ultrasound wave tothe imaging of movingstructures, such

ascardiacmuscle. Theamplitude ofthe echois displayed alongalineinthe

object asa function of time. This isjust likeecho sounder.

One may alsotake into account the Doppler eect to image for instance

the blood velocity, and also to produce color-coded images of the blood

velocity[8]. When the scatteredsignal ismoving,the frequency of the reec-

ted signals will be altered from the transmitted frequency. It is this change

in frequency that is called the Doppler eect. The Doppler shift,

f ₀

^, ^given

by

f _d = 2f ₀ v cos θ

c

^(2.4)

• f ₀

^: ^F^requency ^of transmitted ultrasound.

•

^c: ^Ultrasound ^wave ^velocity.

•

^v: ^Velocity ^of ^the ^scatterer ^(e.g. ^red ^cells).

• θ

^: ^The ângle ^between ^the ^velocity ^directionând ^the ûltrasound ^beam.

An important characterisic about the medium ishow fast these pressure

waves travel in the medium. Table 2.1 illustrates some examples. Another

important problem inultrasound imaging is the tradeo between frequency

and penetration. As we increase the frequency, we decrease the penetration

depth. This alsomeans that the image resolution is poor for organs hidden

deepinthebodybecausewehavetoincrease thewave lengthtoreachthem.

2.2.4 Near-eld and far-eld

Transducers consists of array elements, and the arrays can be either linear,

quadraticor circular.

When working with ultasound images,we operatein the near-eld. This

means that the wavefront of the propagating wave is perceptively curved

with respect to the dimension of the array. The waves will hit the array as

illustrated tothe left in Figure2.3.

In the far-eld case, onthe other hand, we consider the incoming waves

asplane waves asillustrated to the rightin Figure2.3.

(21)

Fat 1440

Kidney 1557

Muscles 1542 -1626

Bone 2700 -4100

Nonbiologicalmaterial Soundvelocity (m/s)

Air 330

Salt water 1531

Quartz 5750

Gold 3240

Table 2.1: Sound velocity in dierent biologicalandnon-biological material,

found in [8].

r r r r r r r r r r r r

Figure2.3: An illustrationof incomingcurved waves (near-eld)to the left,

and plane waves (far-eld)to the right.

(22)

(23)

Networking

In this rst section I will give a brief introduction to some of the basic

principalsinnetwork theory. This includesadescriptionof howtheInternet

is build up, and I introduce the most important protocols which will be

referred tolater in this thesis.

3.1 Motivation

Whatdistinguishesacomputernetworkfromothers(liketelephone-orcable

network) is the generality. Computer networks are not optimized for apar-

ticularapplication,likemakingphonecallsordelivertelevisionsignals. They

support many applications,and carry dierent typesof data.

A network must provide a connectivity among a set of computers. At

the lowest level, a network can consist of two or more computers connected

by some physicalmedium, such as acoaxial cabelor anopticalber. These

physical media is referred to as a link, an the computers is referred to as

nodes.

There are many dierent kinds of network technologies. Examples are

Ethernet,FDDI(Fiber-DistributedDataInterface,astandardfordatatrans-

mission on ber optic lines) and ATM (Asynchronous Transfer Mode, a

dedicated-connection switching technology), and the Internet is an inter-

connection of all the dierent technologies. As new networks are added to

the Internet all the time, the system must scale well. This means that it

is designed to support growth to an arbitrarily size. What supports this

interconnection of multiple networking technologiesinto a single, logical in-

ternetwork is the Internet Protocol(IP), which isa protocolin the Internet

Architecture. Thisarchitectureisawayofsplittingthenetworkservicesinto

layers of abstraction, and it was evolved out of experience with an earlier

(24)

packet-switched network called ARPANET[1 ]. The Internet Architecture

isillustrated in Figure3.1.

FTP HTTP NV TFTP

TCP UDP

IP

NET 1 NET 2 ... NET N

Figure 3.1: The Internet architecture.

It decomposes the problem of building a network into more manageable

componentsandprovidesamoremodulardesign. Itistheseabstractobjects

that make up the layers of a network system that are called protocols. The

InternetArchitectureisalsoreferredtoasTCP/IP,namedafteritstwomain

protocols. IP denes the infrastructure that allows nodes and networks to

function asa single logicalnetwork. The IP service model isconnectionless,

whichmeansthatwedonotsetupaconnection;wejustmakesurethatevery

packet contains enough informationto get it to itsdestination. IP is a best

eort model, which means that it does not guarantee that the packets will

bedelivered, butit willdoitsbest. WhatitdoesNOT mean isthatpackets

can get lost. Packets can be delivered out of order, or the same packet can

be delivered twice. But if something goeswrong, the network does nothing.

3.2 UDP, RTP and RTCP

When media content is streamed to a client in real-time, the client can be-

gin to play the stream without having to wait for the complete stream to

download. Sometimes it is even impossible to download the entire stream

before playing it, because the stream may not have a predened length. In

addition, when we are transmitting media across the net in real-time, we

must require high network throughput. It is easier to compensate for lost

1

A packet switched network: each node in the path to the other machine receivesa

completepacket,storesthispacketinitsinternalmemory,andthenforwardsthecomplete

packet tothenextnode.

(25)

is not that importantthat all the data arrive uncorrupted (orarrive at all).

Thisisverydierentfromaccessingstaticdatasuchasale, wherethe most

importantthingis thatallof the data arriveat itsdestination. This implies

that perhaps the protocols used to transfer static data donot work well for

streaming media.

3.2.1 Protocols used for streaming Media

Both the HTTP and FTP protocols are based on the TCP, a protocol de-

signed for reliable data communications on low-bandwidth, high-error-rate

networks. But TCP is above IP, and I just stated that IP is unreliable. So

howcanTCPguaranteethatthe packetswillbedeliveredtotheclient? This

issolved by retransmission. Ifa packet ismissing,TCP makes sure thatthe

packetisretransmitted. This overheadofguaranteeingreliabledata transfer

slows the overall transmission rate. This is why other protocols like UDP

(User Datagram Protocol) are typically used for streaming media. UDP is

an unreliable protocol; it does not guarantee that each packet willreach its

destination. Further, there is no guarantee that the packets will arrive in

the order they were sent. The receiverhas to beable to compensate forlost

data, duplicate packets, and packets that arrive out of order.

UDP is,just likeTCP, alower-levelnetworking protocol, and more

application-specic protocols are built on top of it. One example of such a

protocol is the Real-Time Transport Protocol (RTP). RTP is the Internet

standardfortransportingreal-timedatasuchasaudioand video,andFigure

3.2 shows the RTP architecture.

Other Network and Transport Protocols

(TCP, ATM, ST-||, etc.) IP

UDP Real-Time Transport Protocol (RTP) Real-Time Control Protocol (RTCP) Real-Time Media Frameworks and Applications

Figure 3.2: The RTP architecture.

As the gure illustrates, RTP is oftenused over UDP, but it is actually

networkandtransport-protocolindependent,anditprovidesend-to-endnet-

work delivery services for the transmission of real-time data. RTP can be

(26)

we send separate copies of the data from the source to each destination,

whereas over a multicast network service, the data is sent from the source

onlyonceand theneworkisresponsiblefortransmittingthedatatomultiple

locations. This lastcase is more ecient for many multimediaapplications,

suchas videoconferences. IP supports multicasting.

3.2.2 RTP Services

RTP enables one to identify the type of data being transmitted, determine

whatorderthepacketsofdatashouldbepresentedin,andsynchronizemedia

streamsfromdierentsources. AsI mentioned above,RTP data packets are

not guaranteed to arrive in the order they were sent - they are in fact not

guaranteedtoarriveatall. Itisuptothereceivertoreconstructthesender's

packetsequence anddetectlostpacketsusingtheinformationprovidedinthe

packet header.

While RTP does not provide any mechanism to ensure timely delivery

or provide other quality of service (QoS) guarantees, it is augmented by a

controlprotocol(RTCP)thatenablesyoutomonitorthe qualityof thedata

distribution. RTCP also provides controland identication mechanisms for

RTP transmissions.

Now, whatifthe qualityofserviceisessentialforaparticularapplication.

The solution issimply touse RTP over aresource reservation protocolthat

providesconnection-oriented services.

AnRTP session isanassociationamongaset ofapplicationscommunic-

ating with RTP. A session is identied by a network address and a pair of

ports. One portis usedfor the mediadataand the othersisused forcontrol

(RTCP) data.

Aparticipantisasinglemachine,host,oruserparticipatinginthesession.

Participation ina session can consist of passive reception of data (receiver),

activetransmission of data (sender), or both.

Eachmediatypeistransmittedinadierentsession. Forexample,ifboth

audioand videoare usedin aconference, one sessionisused totransmitthe

audio data and a separate session is used to transmit the video data. This

enables participants to choose which media types they want to receive. For

example,someone who has alow-bandwidth network connection mightonly

want toreceive the audio portionof aconference.

(27)

As described inthe previous section, datais transferred overthe network as

bits. Sometimes,however, itmightbethecase that theapplicationprogram

needs to send more data in a timely fashion than the bandwidth of the

networksupports. Thisproblemleadstocompression. Say,forexample,that

a video applicationhave a 10-Mbps video stream that it wants totransmit,

but it has only a 1-Mbps network available to it. We can then compress

the data at the sender, then transmit them over the network, and nally

decompress the data atthe receiver.

This eld has a rich history, dating back to Shannon's pioneer work on

information theory in the 1940s, and in anutshell data compression is con-

cerned with removing redundancy from the encoding. I will in this section

present briey the most common compression techniques and standards for

encoding images and video, and introduce some of the most widely used

terms. When it comes to JPEG-coding, and MPEG-coding, I will look a

little bit closer on some details because this is some of the most popular

standards.

Generallyspeaking,therearetwoclassesofcompressionalgorithms: lossless

compression and lossy compression. Lossless compression ensures that the

data recovered from the compression/decompression process is exactly the

same as the original data. Lossy compression on the other hand, does not

promisethat the data received is exactlythe same asthe data sent. This is

becausethe algorithmremovesinformationthatcannotlaterberestored. So

why wouldwe likeanalgorithmthat removesinformationfromthe chunk of

datawewouldliketotransmitoverthe network? This isbecausethe lossless

compressionalgorithmsdoesnot reducethe sizeasmuch asneeded when we

for instance are transmitting a large picture or video le. The challenge is

therefore to design algorithms that take away information that will not be

missed by the receiver.

3.3.1 Lossless compression algorithms

I begin by briey introducing the three probably most known lossless com-

pressionalgorithms: theRun LengthEncoding (RLE), theDierentialPulse

Code Modulation (DPCM) and the dictionary based methods. The reason

for introducing these methods is because they are often used in the lossy

compression algorithms.

(28)

Theidea behindthis methodistoreplaceconsecutiveoccurrencesofagiven

symbol with only one copy of the symbol, plus a count of how many times

the symbol occurs; hence the name run length. For example, the string

EERHHJJJJKwould beencoded as 2E1R2H4J1K.

RLEcanbeusedtocompressdigitalimagerybycomparingadjacentpixel

valuesand thenencoding onlythechanges. Thisisquiteeectivefor images

thathavelargehomogeneousregions. RLEisthekeycompressionalgorithm

used totransmit faxes [2].

Dierential Pulse Code Modulation (DPCM)

Forthiscompressiontype,thebasicideaisrsttooutputareferencesymbol,

and then foreach symbol inthe data, to outputthe dierence between that

symbol and the reference symbol. For an illustration: the string AAAB-

BCDDDD would be encoded as A0001123333 since A is the same as the

reference symbol, B has a dierence on 1 for the reference symbol, and so

on. The benet of this methodis that if the dierencebetween the symbols

issmall, one can encode them with fewer bits than the symbol itself. When

thedierencebetween thesymboland itsreferencesymbolbecomestolarge,

a new reference symbol is selected. By using DPCM, it has been measured

compressionratios of 1.5-to-1 ondigital images[2].

Dictionary based methods

The idea behind this method isto builda dictionary(table) with references

to words and phrases you think might be common in a text. One can for

example say that a particular word maps to a unique number, where the

number willbe encoded in far less bits than the word.

The questionisnowwherethis dictionarycomesfrom. It couldeitherbe

static,ordynamic which meansthat itistransferred in thebeginningofthe

le.

3.3.2 Lossy compression algorithms

The compression ratio of lossless methods is not high enough for images

and video compression, especially when the distribution of the pixel values

is relatively at. The next method, a lossy encoding called JPEG, uses

something called transform coding, and it is largely based on the following

observations:

(29)

•

Â^large^majorityôfûsefulîmage^contents^change^relatively^slowlyâcross

images, i.e., it is unusual for intensity values to alter up and down

several times inasmallarea, forexample, withinan8x 8block. Ifwe

translatethis intothe spatialfrequencydomain,itsaysthat, generally,

lowerspatialfrequency componentscontain moreinformationthanthe

highfrequency components,whichoftencorrespondtolessusefuldetail

and noises.

•

Psychophysicalexperimentssuggestthathumansaremorereceptiveto the loss of higher spatial frequency components than the lossof lower

frequency components.

Image Compression (JPEG)

JPEG(JointPhotographicExperts Group)compressiontakesplace inthree

dierent stages, as shown inFigure 3.3.

Source Compressed

image JPEG compression

DCT Quantization Encoding image

Figure3.3: The three main steps in the JPEGencoding.

When the imageistobecompressed, itisfedthrough thesethree phases

one 8 x 8block atime.

The rst phase appliesaDiscrete Cosine Transform(DCT) tothe block.

Thistransformstheimagefromthe spatialdomainintothespatialfrequency

domain. The DCT is chosen instead of the Fast Fourier Transform (FFT)

because it can approximate linear signals well with fewer coecients. We

apply DCT toseparate the gross features fromthe ne details.

DCT along with its inverse which is performed during decompression, is

dened by the followingformulas:

DCT (i, j) = 1

√ 2N C(i)C(j )

N − 1

X

x=0 N X − 1

y=0

pixel(x, y)cos

(2x + 1)iπ 2N

cos

(2y + 1)iπ 2N

(30)

pixel(i, j) = 1

√ 2N

N X − 1

i=0 N − 1

X

j=0

C(i)C(j)DCT (i, j)cos

(2x + 1)iπ 2N

cos

(2y + 1)iπ 2N

C(x) = 1

√ 2

if

x = 0,

^else

1

^if

x > 0

Thereare,ofcourse,somelossofprecisionduringtheDCTduetotheuse

ofxed-pointarithmetic,but itisinthe quantizationphase the compression

really becomes lossy. In this phase one simply drop the most insignicant

bits of the frequency coecients.

In the nal phase, the encoding phase, we start at position

(0, 0)

ⁱⁿ ^the

output matrix fromthe DCT phase and process the coecients in a zigzag

sequence as shown in Figure3.4.

Figure3.4: Zigzagencoding used inJPEG.

Alongthezigzag,aformofRLE(RunLengthEncoding)isused. Finally,

the individual coecients are encoded using a Human code (a minimal

variable-lengthencoding basedonthe frequency ofeachcharacter). Because

of the dierent characteristics of dierent pictures it is impossible to say

exactly what the compression ratio is, but a widely accepted generalization

saysthatJPEGisabletocompressa24-bitcolorimagebyarationofroughly

30-to-1.

Video Compression (MPEG)

If we doan approximation, we can say that a moving picture(i.e. video) is

justa succession of stillimages (frames) displayed at some videorate. Each

(31)

technique used in JPEG. However, with this strategy we donot exploit the

redundancy present between consecutive frames. MPEG (Moving Pictures

Expert Group)takes this interframeredundancy intoconsideration.

MPEG takes asinput a streamof video frames, and compress them into

three dierent frame types: I frames (intrapicture), P frames (predicted

picture) and B frames (bidirectionalpredicted picture).

I framesare alsocalledreferenceframes. They arethe JPEGcompressed

version of the corresponding frame in the video source. The I frames are

self-contained,asopposed totheP and Bframes. To bemore precise;the P

framesspecifythedierencefromthepreviousIframe, whileaBframegives

an interpolationbetween the previous and subsequent I orP frames. As an

illustration,I showinFigure3.5 anexamplefound in[2]that illustrateshow

I, P and B frames are generated by MPEG. The seven frames in the input

stream result inthe specied outputstream, when compressed by MPEG.

MPEG compression

I frame B frame B frame P frame B frame B frame I frame

Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7

Forward prediction

Figure3.5: Sequence of I, P, and B frames generated by MPEG.

As mentioned above, the two I frames are self contained, and can be

decompressed at the receiver end independently of the other frames. The P

frames, on the contrary, depends on the preceding I frame, and can not be

decoded atthe receiver end if the preceding I frame islost.

In the case of the B frame, the situation is more complicated. These

frames depend on both the preceding I or P frame and the subsequent I

or P frame. This means that both of these reference frames must arrive at

the receiverbeforeMPEG can decode the B frametoreproduce the original

(32)

to be able to decode others, implies that the compressed frames can not

be transmitted in sequential order. The sequence of compressed frames in

Figure3.5, IBBPBBI, willtherefore be transmitted asIPBBIBB.

It isimportanttonote that thereisnodened ratio ofI frames toPand

Bframes,asonemightexpect whenlookingatthegure. Thisratiodovary,

depending onthe picturequality and the compression.

MPEGcodingisaveryexpensive task,itis rathertimeconsuming. This

is why the encoding is normally done o-line, and not in real-time. The

datais thereforeencoded andstored ondiskahead oftime, and can thenbe

transmitted. This may changein thefuture as the processor speed increases

and the algorithms are getting faster. I will now take a closer look at the

three dierent frametypes.

Asalreadymentioned,theIframesare approximatelyequaltothe JPEG

compressed version of the source frame. However, there is one main dier-

ence. MPEG works inunitsof

16 × 16

^macro ^blocks. ^Let^us^say ^that ^a ^color

videois represented by the YUV representation,where the rst component,

Y,represents theluminance, and the lasttwocomponents, Uand V, repres-

ents chrominance. Then the U and V components in each macro block are

down sampledintoan

8 × 8

^block. ^The ^reason ^why ^this ^can ^be ^done ^is^that

the U and V components can be transmitted with less accuracy because of

the fact that humans are less sensitive to color than they are to brightness.

Figure 3.6shows the relationship between the frameand the corresponding

macroblocks.

Fromthegure,thedownsamplingofthe Uand Vcomponentsinto

8 × 8

blocks isvisualized. Eachof the

2 × 2

^sub^blocksⁱⁿ^the ^macro^block^is^given

by one U value and one V value - the average of the four pixels. We have

stillgot four Y values in the sub block.

The macro block processing is not typical for the I frames only, this

principle is used for both the P and B frames as well. Loosely speaking,

one can describe the P and B frames as carriers of information about the

motion inthe video. They carry for each macroblock informationabout in

what directionand how far the macro block moved relative tothe reference

frame(s).

Taking a closer look at the encoding of the B frame, it is important to

notethatthemacroblocksintheB framedonot necessarilydependonboth

an earlier and later frame as suggested above. It may be specied to just

one or another. It is in fact possible for a given macro block in a B frame

to use the same intracoding as is used in an I frame. This exibility exists

becauseif there isa rapidchange inthe motionpicture, itsometimes makes

sense to give the intrapicture encoding rather than a forward- or backward-

(33)

pixel region 16x16 Color frame

with Y components 16x16 macro block

with U components 8x8 macro block

8x8 macro block with V components

Figure3.6: Each frameas a collectionof macroblocks.

predicted encoding.

From nowonI assumethat the macroblock uses bidirectionalpredictive

encoding,thatis,there isadependencytoboth anearlier andalater frame,

the macro block is represented in the following way.

•

^a^coordinate^for ^the ^macro^block ⁱⁿ^the ^frame.

•

^a^motion ^vector ^relative^to^the ^previous ^reference^frame.

•

^a^motion ^vector ^relative^to^the ^subsequent ^reference^frame.

•

^a

δ

^for êach ^pixel ⁱⁿ^the ^macro ^block, îndicating^how ^much êach ^pixel

has changed relativeto the two reference pixels.

Ilet

F p

^and

F f

^denote^the ^past^and^future^reference^frames,respectively, and

(x _p , y _p )

^and

(x _f , y _f )

^the ^past ^and ^future ^motion ^vectors. ^T^o ^nd ^the ^pixels

inthe macroblock, weneed tondthe correspondingreference pixelsinthe

past and future reference frames for every single one of them. This is done

by using the two motionvectors associatedwith the macroblock. Afterthe

average of the two referencepixelsis computed,the

δ

^for^the ^pixel ^is^added.

If

(x, y)

^is^the coordinates of aparticular pixelin the currentframe,

F _c

^,^this

can be stated mathematicallyas

F _c (x, y) = (F _p (x + x _p , y + y _p ) + F _f (x + x _f , y + y _f ))

2 + δ(x, y)

^(3.4)

(34)

The

δ

^'sâre êncoded ^by ^DCTând ^then^quantized. ^The^P ^framesâre ^handled

inthe same manner, but nowthe frames depend ononly referenceframes.

3.3.3 The Java Media Framework

The Java Media Framework (JMF) is a new API, released about two years

ago,whichwasdevelopedtoextendtheJavalanguageinthesenseofmeeting

theincreaseddemandsformultimedia. TheearlyfeaturesofJavaonlyhinted

atthe possibilities: the audio capabilities of appletswere limitedto a single

format,the AU-format 2

,there wasnosupportforvideo, andanimationwas

limited to series of GIF (Graphics Interchange Format) images. It was not

until the introduction of the Java Media Abstract Programming Interfaces

(Java Media APIs 3

), of which the Java Media Framework is one, that Java

begantomeettheincreasingdemandsformultimedia. TheJavaMediaAPIs

supporttheintegrationofawiderangeofaudioandvideoformats. Advanced

imaging,animation, two- and three- dimensional graphicsand modeling, as

well as speech and telephony support, are features supported into Java ap-

plications and applets. The JMF part, that I have been concerned with in

these thesis, consists of a suite of three APIs. These APIs are designed for

the capture andplayback of audioand video. The rst ofthese three,which

is the one that I have been working on, is the JMF Player API, describing

audio and video playback. The other two APIs describe video and audio

capture capabilities and videoconferencing capabilities.

As with pure Java, an application written to the JMF Player API is

capable of operating on any Java platform that supports a conforming im-

plementationofthePlayerAPI.TheAPIisstillnotintegratedintodaysweb

browsers,so tobe abletoview anappletwrittento the JMFPlayerAPI on

the Internet, the user have toinstall the package onhis/her computer.

Anotherfeature,andthisisthemainreasonwhyFASTwantedamultime-

diastreaming application inJava, isthat no software installationis needed.

Theuserdonothavetoinstalldierentsoftwareplayerstoplaybackdierent

leformats,it isalltaken care of by the applet-downloaded inconjunction

with the web page. The JMF Player API is independent of any particular

datatype. ThePlayerimplementationiswrittentosupportaparticulartype

of media data,and istherefore easilyintegrated with the installed JMF, or,

soon, JMF integrated in the web browsers. Some of the supported audio

formats and video codecs supported by the current implementation of the

JMF is: AIFF, AVI, MIDI, MPEG-1 video, MPEG Layer II Audio, MPEG

2

Theselesuses

µ

^-law^encoding,^alogarithmically8bitscodingscheme 3

The API is just a collection of classes that provides capabilities in areas such as

audio/videoplaybackin Javaapplicationsandapplets

(35)

got support for commonprotocols, which means that a JMF player can ob-

tain media data using both the FILE, FTP (FileTransfer Protocol), HTTP

(HyperText Transfer Protocol) and RTP (Real-timeTransportProtocol).

It is also possible to extend the player to support other media formats,

whichwastheobjectiveofthecooperationwithFAST,andtheAPIdescribes

a simplemechanism tosynchronize the playback of multiple media sources.

3.3.4 The Java Native Interface

Java Native Interface (JNI) Application Programming Interface (API) was

introducedinrelease1.1oftheJavaDevelopmentKit(JDK)andisatoolfor

integrating Java with native languages as C and C++, that is, make them

"talk"toeachother. Thereare several reasonswhywe wanttodothis. One

reason may be that we need to perform some heavy computations that are

better taken care of by C or C++, or, as in my case, we have an existing

C++classlibrarywewanttousewithouthavingtorewritethewholelibrary

inJava.

The drawback is that we have to leave the connes of the Java Virtual

Machine(JVM).TheJVMexecutestheJavabytecode andperformssystem-

specic operations onbehalfof a Java application, and it is the virtual ma-

chine that makesit possible to"write once and run anywhere".

A Javaapplicationthat usesnativemethodsdepends onthe executionof

code running directly on the host hardware. Java code is compiledto Java

bytecode and is executed by the JVM. The JVM acts as an intermediary

between the bytecode and the host hardware. Native code, on the other

hand, can bethought of as running outsidethe JVM. It is platform-specic

object code.

So, where does this native code exist? Most typically, in a dynamically

loadedlibrary(DLL)thatwasbuiltusingthehostmachine'snativecompiler

and linker. The contents of this library are platformdependent becausethe

object code generated by the native compiler is processor-specic. Beyond

that, there may be all sorts of code in the library that references platform-

specicdevises oranapplication-specicdatabaseinterface. Inanycase,the

dynamicallyloadable library contains binary objects.

Once the librarycontainingbinary objects isbuilt foraspecic platform

Y adheretothe JNI API, thatlibrary can beused, withoutrebuilding, with

any JVM that supports JNI on platform Y. This means that if a vendor A

and avendor B both provide a JVM with JNI support onmachine Y, then

any library using only JNI calls to interface with the JVM will work with

either A orB's JVM.

(36)

(37)

Low-rate Speech Coding

Speechcoding orspeechcompressionaretwonamesoftheeldconcernedwith

obtainingcompact digital representations of voice signalsfor the purpose of

ecient transmission or storage. One has during the last decade witnessed

substantial progress towards applications of low-rate speech coders both to

civilian and military communications, not to mention the computer-related

voice applications.

What has made this progress possible,is the development of new speech

coders capable of producing high-quality speech at low data rates. As the

opticalberbandwidthinwired communicationhasbecomeinexpensive,one

could askwhy weare so concernedabout the preservation of lowdata rates.

The answer is that there is a growing need for bandwidth conservation in

wireless cellular and satellite communications. There are also voice-related

applications designed for computers (like voice mail), and most of these ap-

plicationsrequire speechsignal indigitalformat, sothat itcan beprocessed,

stored, ortransmitted undersoftware control.

The digitalization gives us advantages in the form of exibility and the

opportunities for encryption, but at the same time, when uncompressed,

it is associated with a high data rate and therefore high requirements of

transmission bandwidthand storage. Speech-coding includes both sampling

and amplitude quantization, and the objective is torepresent speech with a

minimumnumberof bits whilemaintainingits perceptual quality.

I willin this chapter describesome of the earlier coding techniques, and

the development which leads to the technique I will be concerned with in

these thesis, namely the MPEG-4 CELP algorithm, a highly eective and

exible coding technique used for low-bitrate speech coding. I willalso give

a brief introduction to the basic properties of speech and some historical

perspectives.

(38)

Since I will be talking about discrete time signals, I will adopt the notation

usedinmosttextbooksconcernedwithdiscretetimesignalprocessing. Iwill

relate discretetime speech,

s(n)

^, ^to^analog ^speech,

s _a (t)

^, ^by ^the ^relation

s(n) = s _a (nT ) = s _a (t) | t=nT

^(4.1)

where

T

îs^the^sampling^period. Ânother^common^notationîs^to^let^lower-

case symbols denote time-domain signals, and to let upper case symbols

denote transform-domain signals, unless otherwise stated. When I refer to

matricesand vectors, I willdenotethem by boldletters.

4.2 Basic Speech Properties and Historical

Overview

Already more than fty years ago speech coding research started, with the

motivation of transmitting speech over low-bandwidth telegraph cables. In

theearliestvoice coders,orvocoders asI willrefertothemthroughoutthese

thesis, the idea was to analyze speech in terms of pitch and spectrum and

synthesize it by exciting a bank of ten analog band-pass lter, which rep-

resented the vocaltract, withperiodicorrandomexcitation. Theseperiodic

andrandomexcitationsrepresented voicedandunvoicedsoundsrespectively.

Asillustrated,vocodersexploitthepropertiesofspeech. Oneoftheprob-

lems with speech signals is that they are non-stationary. With a stationary

process I mean a random process

x(n)

^where ^the ^following ^conditions ^are

satised[7]:

1. The meanof the process is aconstant,

m _x (n) = m _x

2. The autocorrelation

r _x (k, l)

^depends ^only ^on^the ^dierence,

k − l

3. The varianceof the process is nite,

c x (0) < ∞

ThisimpliesthatwecannotanalyzethesignalwiththeconventionalFourier

transform. What we can do, is to consider the signals as quasi-stationary

overshort time-segments,typically 15-20ms.

Speech istypically classiedas voiced (e.g. /a/, /i/, etc.), unvoiced

(e.g. /sh/), or a combination of the two. Both has certain characteristics:

voiced speech is quasi-periodicin the time domain and harmonically struc-

tured in the frequency domain, while on the other hand unvoiced speech is

(39)

ergy than the unvoiced segments. Some examples are shown in Figure 4.1

and 4.2

Figure 4.1: The periodicity obtained when pronouncing the voiced sound

/a/.

Figure 4.2: The random like time domain plot of the pronunciation of the

unvoiced sound/sh/.

From theseplots, weclearly see the quasiperiodicityinFigure4.1where

I have plottedapronunciation of thevowel a. In Figure4.2onthe contrary,

we see the random like time domain plot of a pronunciation of the fricative

(orunvoiced sound) sh.

The corresponding frequency domain plots of the /a/ and /sh/ sounds

can be seen in Figure 4.3and 4.4, respectively,and they both correspond to

what I mentionedabove.

(40)

Figure 4.4: The broad band spectrum plot of the pronunciation of the un-

voiced sound /sh/.

One characterize the short-time power spectrum(I willonly refertothis

asspectrum unlessotherwise stated) by itsne and formant structure. The

ne harmonic structure is represented by the peaks in the spectrum, and it

isaconsequence of the quasi-periodicity ofthe speech. Itmay beattributed

to the vibrating vocal cords. We see from Figure 4.3that there are several

peaks in the spectral envelope. These peaks are called formants, and for

the average vocal tract there are between three and ve formants below 5

kHz. Thelocationandheightoftheformantsgivesusimportantinformation

both inspeechsynthesis and perception, andwideband andunvoiced speech

representations. We know that peaks in the spectrum indicates periodicity

inthe time domainrepresentation of the signal. In the converse occasion;if

the time domain representation consists of a single peak, we will get a at

(41)

examples, we will have a combination of the two. The spectral envelope -

the formant structure - isdue to the interaction of the source and the vocal

tract. When I refer tothe vocal tract, what I mean is the pharynx and the

mouth cavity.

Before I start describing the dierent coding techniques I willrelate the

propertiesof speech tothe physical speech productionsystem. Wecan clas-

sify speech sounds into three distinct classes [9] according to their mode of

excitation. Starting with the voiced speech; the vocal cords are vibrating,

generating quasi-periodicglottal airpulses that excite the vocaltract which

inturnproducethe voiced speech. Unvoiced speechorfricative,onthe other

hand, is generated by forcing air through a constriction in the vocal tract.

Plosive sounds,like/k/, areduetoabruptlyreleasingairpressure whichwas

built up behind a closure inthe tract. A time domainplot and a frequency

domainplot of the plosivesound /p/ is shown in Figure4.5and 4.6

Figure 4.5: The time domainplot of the plosivesound /p/.

The sub-glottal systemis composed ofthe lungs, bronchiand trachea. It

is the sourceof the energy forthe productionof speech,which issimplythe

acousticwavethatisradiatedfromthissystem whenairisexpelled fromthe

lungs and the resultingow of airis perturbed by a constriction somewhere

inthe vocaltract.

4.3 Dierent Types of Vocoders

There are both parametric and non-parametric coding techniques. The

simplestnon-parametriccodingtechnique isPulse Code Modulation(PCM).

(42)

Thisissimplyaquantizationofthe sampledamplitudes. Itisclear thatthis

technique alone willnot give usthe low data rates weare seeking.

To obtain speech coding at the medium-rates and below, what we need

is to use an analysis-synthesis technique. This is a parametric approach.

Brieydescribed; the speech is,in the analysis stage,represented by a com-

pact set of parameters which are encoded eciently. Then, inthe synthesis

stage, these parameters are decoded and used in conjunction with a recon-

structionmechanismtoformspeech. This isthe ideathat formsthebasis in

the CELP algorithm,but here they use analysis-by-synthesis inthe analysis

stage. That is,the parameters are extracted and encoded by minimizingex-

plicitlyameasureofthedierencebetweentheoriginalandthereconstructed

speech. The measure isusually a mean square measure.

I mentioned the PCM method as a straight forward method for discrete

time,discreteamplitudeapproximationofanalogwaveforms-butitdoesnot

takeintoaccounttheredundancyinthesignal. Asdigitalcomputersbecame

more powerful, and with the exibility they oered, one could experiment

with more sophisticateddigital representations of speech.

Thedevelopmenthasgonethroughseveral stages,fromthesimplespeech

source-systemproductionmodeldepictedinFigure4.7, whereaslowlytime-

varying system is excited by periodic impulse train for the voiced speech,

and by random excitation for unvoiced speech, to the more complex linear

prediction algorithmwith stochastic vector excitation called "Code Excited

Linear Prediction". The vocal tract lter in gure (4.7) is anall-pole lter,

anAR-model. The parameters are obtained by Linear Prediction. This is a

process wherethe present speech sampleispredicted by the linear combina-

tion of previous samples. I willreturn to AR-models and Linear Prediction

ina later chapter.

(43)

VOCAL TRACT FILTER

SYNTHETIC SPEECH gain

Figure4.7: The engineering modelfor speech synthesis.

During the last decades, techniques like Short Time Fourier Transform

(STFT), transform coding, and sub-band coding has also been exploited

in the analysis - synthesis process. CELP is also based on LPC, but it

alsouse anecienttechnique toencode the LPC parameters, namelyvector

quantization.

4.4 How to Measure the Performance

Before one can rank the dierent algorithms and techniques, one needs to

know how the performance is measured. The typical way to evaluate the

dierent speech coding algorithmsis based on the bitrate, the quality of re-

constructed speech, the complexity of the algorithm, the delay introduced,

and the robustness of the algorithm to channel errors and acousticinterfer-

ence.

These are the criteria one have in mind, and the next important task is

howto gaugethe speech quality. This isnot asimpletask. One ofthe most

common objective measures, not only in speech processing algorithms, but

inallsignalprocessingalgorithms,isthesignal-to-noise(SNR)ratio,dened

by the following equation[10]

SNR

= 10 log ₁₀

( P M

n=0 s ² (n) P M

n=0 (s(n) − s(n)) ˆ ² )

,

^(4.2)

where M is the number of samples,

s(n)

^is ^the ^original ^speech ^data, ^while

ˆ

s(n)

^is ^the ^coded ^speech ^data.

This tendstobeasatisfyingmeasure, butthere isaminorproblem: this

isalongtermmeasure,soitwillhide temporalreconstructionnoise. These

temporalvariationscanbebettermeasuredusingashort-timesignal-to-noise

ratio-thatis,wecompute theSNRforeachN-pointsegmentofspeech. The

segmentalSNR (SEGSNR) is given by

(44)

SEGSNR

= 10 L

L − 1

X

i=0

log ₁₀

( P _N ₋ ₁

n=0 s ² (iN + n) P _N ₋ ₁

n=0 (s(iN + n) − ˆ s(iN + n)) ² )

(4.3)

L

^is ^the ^number ^of

N

^-point ^sequences ^we ^have ^split ^the ^signal ^into. ^What

we should note here, is that the averaging occur after the logarithm. This

impliesthat coders withvariant performance will be morepenalized.

TheSNRisthemostcommonmeasure,butnottheonlyone. Othermeth-

ods mentioned in the literature are the articulation index, the log spectral

distance (a number that allows two measurement vectors to be compared),

andthe Euclidean distance (the square rootofthe sum of the squares ofthe

entries of avector).

4.5 Physical Aspects of Speech Modeling

Thesimplestphysicalcongurationthat hasa usefulinterpretationinterms

of the speech productionis depicted inFigure 4.8and was found in[9]

Figure 4.8: A simpliedvocaltract.

Here, the vocaltract is modeled as atube of non-uniform, time-varying,

cross-section. Asalreadymentioned,soundisalmostsynonymouswithvibra-

tion. Thesound-wavesarecreatedbyvibrationandtheypropagatethrough

air. So, what is the fundamental indescribing this generation and propaga-

tionofsound inthe vocalsystem isthe lawsof physics,andinparticularthe

laws ofconservationofmass, conservationofmomentum andconservationof

energy, just tomention a few.

Describing the motion of airin the vocal system leads to aset of partial

dierential equations which formulation and solution is extremely dicult

unless we do some very simple assumptions about vocal tract shape and

energylossesinthevocalsystem. Thisiswhatwedoingure(4.8). Whatwe

doistoassumethatthewavesareplanewaves (seeSection2.2.4)propagating

(45)

wavelengths are less than about 4000Hz, that is - the wavelengths are long

compared tothe dimension ofthe vocaltract. Ifwe inadditionassume that

therearenolossesduetoviscosityorthermalconduction,andfurtherassume

the laws ofconservation ofmass, momentumand energy -ithasbeenshown

that the sound waves inthe tube satisfy the followingequations:

− ∂p

∂x = ρ ∂(u/A)

∂t

^(4.4)

− ∂u

∂x = 1 ρc ²

∂(pA)

∂t + ∂A

∂t

^(4.5)

4.6 Auto Regressive Moving Average Models

When I later describethe theorybehindlinear predictive coding (LPC) and

code excitedlinearprediction(CELP) 1

,bothcentralinthediscussionofthe

MPEG4 coder, one will descover that these techniques are based on signal

modeling using so-called auto regressive models (AR-models). AR-models

are aspecialcase ofthe moregeneralauto regressivemovingaverage models

(ARMA-models).

4.6.1 ARMA-models

The idea is to design a lter, that, excited by a special sequence

v (n)

^, ^will

generate anapproximation,

x(n) ˆ

^, ôf ôur ôriginal ^signal

x(n)

^. ^The ^lters ^we

usefor thispurposeare causal,linear,shift-invariantlters havingarational

system function with

p

^poles ^and

q

^zeros. ^These ^are ^called ARMA-models, and are dened by the system function

H(z) = B _q (z) A _p (z) =

P q

k=0 b _q (k)z ⁻ ^k 1 + P _p

k=1 a _p (k)z ⁻ ^k

^(4.6)

Let

v(n)

^be ^white^noise ^with ^power ^spectrum

P _v (z) = σ _v ²

^. ^If ^we^assume ^that

H(z)

ⁱⁿêquation ^(4.6) îs^stable, ând ^lter^the ^sequence

v(n)

^with ^this ^lter,

the power spectrum of the output process

x(n)

^will^be

P _x (z) = σ _v ² B _q (z)B _q ^∗ (1/z ^∗ )

A _p (z)A ^∗ _p (1/z ^∗ ) ,

^(4.7)

1

Tobeprecise,CELP isanextensionofLPC.

The use of Java in different signal processing applications including a text independent speaker recognizer based on cepstral coefficients mathematical modeling

∂ 2 s

∂x 2 + ∂ 2 s

∂y 2 + ∂ 2 s

∂z 2 = 1 c 2

∂ 2 s

∂t 2 ,

s = s(~ x, t)

c

~ x = (x, y, z)

T p

B w

∆ r = cT p /2 = c/2B w ,

c

ρ

κ

c = 1/ √

ρκ

f 0

f d = 2f 0 v cos θ

c

• f 0

•

•

• θ

r r r r r r r r r r r r

FTP HTTP NV TFTP

TCP UDP

IP

NET 1 NET 2 ... NET N

Other Network and Transport Protocols

(TCP, ATM, ST-||, etc.) IP

UDP Real-Time Transport Protocol (RTP) Real-Time Control Protocol (RTCP) Real-Time Media Frameworks and Applications

•

•

Source Compressed

image JPEG compression

DCT Quantization Encoding image

DCT (i, j) = 1

√ 2N C(i)C(j )

N − 1

X

x=0 N X − 1

y=0

pixel(x, y)cos

(2x + 1)iπ 2N

cos

(2y + 1)iπ 2N

pixel(i, j) = 1

√ 2N

N X − 1

i=0 N − 1

X

j=0

C(i)C(j)DCT (i, j)cos

(2x + 1)iπ 2N

cos

(2y + 1)iπ 2N

C(x) = 1

√ 2

x = 0,

1

x > 0

(0, 0)

MPEG compression

I frame B frame B frame P frame B frame B frame I frame

Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7

Forward prediction

16 × 16

8 × 8

8 × 8

2 × 2

pixel region 16x16 Color frame

with Y components 16x16 macro block

with U components 8x8 macro block

8x8 macro block with V components

•

•

•

•

∂ ² s

∂x ² + ∂ ² s

∂y ² + ∂ ² s

∂z ² = 1 c ²

∂ ² s

∂t ² ,

T _p

B _w

∆ _r = cT _p /2 = c/2B _w ,

f ₀

f _d = 2f ₀ v cos θ

• f ₀

(x _p , y _p )

(x _f , y _f )

F _c

F _c (x, y) = (F _p (x + x _p , y + y _p ) + F _f (x + x _f , y + y _f ))

s _a (t)

s(n) = s _a (nT ) = s _a (t) | t=nT

m _x (n) = m _x

r _x (k, l)

= 10 log ₁₀

n=0 s ² (n) P M

n=0 (s(n) − s(n)) ˆ ² )

log ₁₀

( P _N ₋ ₁

n=0 s ² (iN + n) P _N ₋ ₁

n=0 (s(iN + n) − ˆ s(iN + n)) ² )

∂x = 1 ρc ²

H(z) = B _q (z) A _p (z) =

k=0 b _q (k)z ⁻ ^k 1 + P _p

k=1 a _p (k)z ⁻ ^k

P _v (z) = σ _v ²

P _x (z) = σ _v ² B _q (z)B _q ^∗ (1/z ^∗ )

A _p (z)A ^∗ _p (1/z ^∗ ) ,