Crowdsourcing Subjective Quality Assessment of Multimedia Content

(1)

UNIVERSITY OF OSLO Department of Informatics

Crowdsourcing Subjective Quality Assessment of

Multimedia Content

The Multimedia Assessment Tool

Master thesis

Ben Christopher Tomlin

15th November 2013

(2)

(3)

Crowdsourcing Subjective Quality Assessment of Multimedia Content

Ben Christopher Tomlin 15th November 2013

(4)

(5)

Abstract

Multimedia services are ever-increasing in popularity and are today widely accessed through countless numbers of computers and mobile devices.

Service providers strive to supply users with a satisfying experience, regardless of the hardware capabilities or network environments the user is faced with. Considering that every user is different, Quality of Experience (QoE) experiments have been developed to assess what is satisfactory. With the growing potential of crowdsourcing, it is becoming more and more feasible to have an Internet crowd conduct subjective assessments on their personal computers rather than in a traditional laboratory. This opens up for a more diverse set of participants at a lower economic cost. As the main goal is to provide a satisfying end-user experience, there is a strong need for a framework that can measure the quality of multimedia content efficiently and reliably.

In this thesis, we develop and present a crowdsourceable framework for performing subjective quality assessment of multimedia content. While documenting the development process of our framework, we provide a thorough explanation of how it is built and how it works, consequently enabling researchers to run unique experiments according to their needs.

The advantages of this framework compared to the traditional studies conducted in controlled environments are many, but we will also highlight the remaining challenges associated with our approach. Building on a solid theoretical framework, we aim to demonstrate that, with our application, researchers can outsource their experiments within multimedia quality assessment to an Internet crowd without risking the quality of the results.

Consequently, while providing reliable evaluation, we obtain a higher level of participant diversity at a much lower cost.

(6)

(7)

List of Figures

2.1 Digital video processing. . . 7

2.2 Compression and available bandwidth [12]. . . 8

2.3 Global mobile data traffic forecast, 2012-2017 [24]. . . 9

2.4 Distinct blocking artifacts in an image [12]. . . 10

2.5 Mosquito noise around edges of objects [12]. . . 10

2.6 A common crowdsourcing value chain. . . 12

2.7 Quadrant of Euphoria’s experiment interface under both space-bar states [6]. . . 16

4.1 The LAMP software architecture. . . 28

4.2 Request and delivery of static content. . . 29

4.3 Request and delivery of dynamic content. . . 30

4.4 Composition of modern web sites and applications. . . 32

5.1 Server-client communication over the World Wide Web. . . . 39

5.2 MVC components and interaction pattern. . . 41

5.3 Application file structure. . . 42

5.4 Entity-relationship model of the database. . . 44

5.5 The interface’s front page, including the log-in form. . . 45

5.6 The administration section of the application. . . 47

5.7 The experimentation setup and control interface. . . 49

5.8 ACR test on a video sequence during an experiment. . . 52

5.9 Results/statistics of a question from a test experiment. . . . 53 5.10 Class diagram of super-classes and some example sub-classes. 56

(10)

(11)

List of Tables

1.1 A typical rating scale, as used in MOS. . . 2

3.1 ACR’s recommended five point rating scale. . . 20

3.2 DCR’s recommended five point impairment scale. . . 20

4.1 Video format support across primary browsers. . . 35

4.2 Audio format support across primary browsers. . . 35

5.1 Descriptions of the database tables used in the application. . 43

5.2 Available question types in the application. . . 50

5.3 Multimedia formats with cross-browser support in MAT. . . 60

6.1 Scalability-/stress-test of server and application software. . 67

(12)

(13)

Acknowledgements

I would like to thank my supervisors, Ragnhild Eg and Carsten Griwodz, for their valuable feedback and guidance during the work with this thesis.

Thanks to the guys at the lab as well, for the moral support and a good environment to work in. Finally, I would like to thank my friends, and especially my parents, for their great support throughout my studies.

(14)

(15)

Chapter 1

Introduction

Technology in this day and age is evolving at an alarmingly fast rate.

To keep up with current standards, developed software must meet users’

preferences and needs. Through the last decade, one platform specifically has been formed as the de facto standard to perform many kinds of wide audience studies. The traditional paper-and-pen approach for conducting surveys takes valuable time and resources, especially when dealing with multimedia. Evidently, a more efficient and dynamic form of conducting these types of surveys is needed. The Multimedia Asessment Tool is presented as a reliable alternative to conduct user studies on the now highly accessible and diverse platform we call the Internet.

1.1 Background

Multimedia quality is typically approached in one of two manners. Either through objective metrics that consider a wide range of measured facts, like signal-to-noise ratios [22, 35], or through subjective measures that are based on the opinions of users [7]. The latter is typically referred to as Quality of Experience [33, 30], and is the method of main interest in this project. While objective metrics are powerful in their consistency, the perception and experience of multimedia quality remains highly subjective. Only human opinion can provide feedback, for instance, on which type of distortion is more distracting or how quality perception may change depending on the video content. The conventional method for collecting subjective opinions on multiple items is through user studies. These studies are often conducted using statistical surveys. Unlike a marketing survey, a statistical survey is aimed at a specific area of research. Typically, surveys provide questions to be assessed according to a range of options, frequently in the form of a scale. A rating scale can take many forms, but they all present a range of response options where one or more have to be selected. The Mean Opinion Score (MOS) rating test is an example of an assessment method which uses a typical five point rating scale [28], as seen in Table 1.1.

Moreover, MOS is one of the more well-known methods for assessing the QoE of multimedia content. Originally, it was used in telephone networks to obtain the user’s opinion of the quality of the network. Listeners would

(16)

sit in a quiet room and score quality of telephone calls as they perceived it, as explained in detail in the ITU-T Recommendation P.800 [28]. Although it was originally intended for rating quality of telephone networks, it has become a popular test for assessing quality levels and degradations of other multimedia types. Additional methods are also commonly used in multimedia evaluation studies, which we will discuss later in the thesis.

5 Excellent

4 Good

3 Fair

2 Poor

1 Bad

Table 1.1: A typical rating scale, as used in MOS.

1.2 Problem Definition

As technology has evolved and become more accessible and diverse, so has the need to evaluate its quality become more imperative. Consequently, multimedia research is not only concerned with optimising solutions, but also with evaluating what is optimal. Commonly, audio and video clips are presented to a group of users in a controlled environment and participants are then instructed to rate each clip as they go. However, in itself, technology has provided a new platform for assessment studies.

The traditional survey method can become cumbersome when dealing with multimedia, emphasising the need for a functional and fluent survey method that implements the media it is designed to evaluate.

Moreover, this new platform also reduces the need for the presence of a researcher. Thus, this project aims to develop an online assessment tool for multimedia content where users no longer need to travel to a research facility, but instead can complete the survey when and where they please.

The Multimedia Assessment Tool (MAT) therefore has the potential for reaching out to a larger and more varied group of users, providing the tools and the foundation for thorough research adapted to contemporary technology.

The thesis mainly focuses on developing MAT for assessing multimedia content. MAT should be accessible on all major web browsers and include support for running multiple experiments, or surveys, simultaneously.

As with most other web applications, it is easily accessible for both experimenters and respondents, through a simple, but extensive, point- and-click graphical user interface. The software must be contained in a central experimentation server and is able to run on most standard web servers with PHP and MySQL installed. The surveys and all the underlying content, like audio/video clips, questions, instructions, rating scales and

(17)

so forth, are also stored here, including responses and results gathered from users taking part in the surveys. Furthermore, related research on the topics of online statistical surveys, subjective evaluation and methodology is included. For example, why do we want to evaluate multimedia in the first place? Why is quality such an important factor? Research into these topics is essential for the project, as well as important in understanding how this framework should be developed, and to what purpose.

Following is a list of features and functionalities that are to be included in the implementation.

• Creation of new surveys, selection of multimedia content, invitation and unique logins for users, and definition of instructions prior to the commencement of surveys.

• Creation of questions and response options, specification of the type of question and, if applicable, indication of the number of items on a scale.

• Specification of the order of audio/video clips, number of repetitions, grouping, and randomisation within groups and of the groups themselves.

• Collection of responses, response times, and other relevant technical features.

• Output of response data in a comprehensible format, for example spreadsheets, and some basic statistical analyses.

The final outcome is a distributed tool with a graphical user interface, adapted to run in an online web browser, and therefore easily accessible to the vast majority of people.

1.3 Main Contributions

The evaluation framework presented in this thesis aims to provide researchers with an extensive and flexible application that can be used for running a variety of subjective assessment studies on multimedia quality, using the Internet platform. With the current lack of similar tools available to the research community, MAT may prove to be a beneficial addition in running assessment studies, both cost- and time-efficiently. These online studies can be conducted from any where at any time, greatly benefiting participants, while also allowing for a larger and more diverse group of people to take part. Moreover, it should lessen the time and effort required of participants, as well as the experimenter. Thus, MAT should offer a simple and efficient solution for running quality evaluation studies.

1.4 Research Method

Initially we performed research into subjective quality assessment, generally related to multimedia. Traditional assessment studies of this kind

(18)

showed that the methodology and execution of these studies could be both time-consuming and expensive, so an online approach to this problem re- vealed itself as a promising alternative. Further research was done into online assessment tools that could provide similar features as traditional studies, but few could be found that provided the flexibility we were looking for.

With the Multimedia Assessment Tool, presented in detail in this thesis, we first had to evaluate the necessary requirements and possible applications of the software. The task of designing and implementing it was then undertaken, using technologies such as the LAMP software bundle [36], and the newly updated HTML and CSS standards for building web interfaces [50]. Last, but not least, a thorough discussion of the system as a whole was conducted, examining everything from benefits to possible issues.

1.5 Outline

This thesis is organised as follows. Chapter 1 provides an introduction to the thesis, explaining the background and importance of creating the described framework. Chapter 2 includes background information on relevant topics, including quality assessment, multimedia and crowdsourcing.

In Chapter 3, we go into further detail about subjective quality evaluation and methodology, as well as some details on experimentation and ethical considerations. Chapter 4 follows with some important elaboration on the technologies and frameworks that are used in the implementation of the application, while Chapter 5 will discuss in detail the design and implementation of the software itself. In Chapter 6, we will discuss and evaluate the system we have developed, and finally in the last chapter, we summar- ise everything with a conclusion.

(19)

Chapter 2

Background

An important topic and motivation behind the work of this thesis, is what is generally known asquality assessment. The main purpose of the tool we are developing is exactly that, evaluating quality, more precisely assessing multimediaquality. This chapter will go into further detail about the subject of quality assessment, along with touching upon the topics of multimedia and crowdsourcing. These topics all play a central part in this project, and furthermore emphasise the importance and applicability of the assessment tool we are developing.

2.1 Quality Assessment

Across research institutes, in industries and in research in general, people use data in assessment and decision making. Data-based decision-making is an essential element of continuous quality improvement, and helps individuals and teams to assess the efficiency and effectiveness of current processes [45]. There are several methods for collecting data; focus groups, personal interviews, review of records, counting events, and of course surveys.

Quality assessment in itself can be divided into two related, but different categories, quality of service (QoS) and quality of experience (QoE). The two methods are mainly different in how they determine quality. QoS methods measure quality objectively, while QoE generally use subjective measures. While QoS may be an important factor in assessing critical parts of a system, QoE is often essential for providing information on how the end-user perceives the overall quality. This will be explained in further detail in the following sections.

2.1.1 Quality of Service

Quality of Service (QoS) is a term that is often used within several aspects of computer science. Originally, it was defined by the International Telecommunication Union (ITU) within the field of telephony [32], but has played an equally important part in computer networks and similar technology. The QoS concept refers to an objective system performance

(20)

metric, such as the bandwidth, delay, and loss rate of a communication network. Objective methods can be divided into two categories: signal- based methods and parameter-based methods.

2.1.2 Quality of Experience

Quality of Experience (QoE) is a common term used for defining the quality of a service based on users’ own individual opinions. Thus, experiments in QoE are referred to as subjective. This however is not always the case, as it is also possible to run objective QoE experiments. This is commonly done by using objective measures to detect or determine quality issues that the human user would perceive as annoying and thus lessen the experience, for example by analysing a multimedia clip and checking if unnatural noise occurs in the processed video segment. Peak signal-to- noise ratio (PSNR) is an example of an objective QoE measure which is often used in quality assessment of multimedia. However, this metric is only conclusively valid when used to compare results from the same content and codec type [22]. Thus, although objective methods are in general more convenient to use, subjective methods are often needed nonetheless. Subjective QoE experiments provide factual assessments of users’ experiences, and no matter how sophisticated objective assessment methods may be, they cannot capture every QoE attribute that may affect the experiences of users [56]. Multimedia in particular has become such an essential part of our everyday lives, that QoE of multimedia content will be an especially important issue for the foreseeable future.

Experience is, obviously, highly subjective. People with different cultural background, social and economic status, and personal experiences often react differently to similar experiences. For example, just changing some colours in an interface may change the effect it has on different people. Moreover, experience is context-dependent [33]. The same multimedia content may result in a different experience by the same person depending on the context. By context, we refer to the person’s understanding of the situation or experience. This, however, can be difficult to identify and, although it may be taken into consideration, it is often not practical to try to measure.

Thus, to judge which particular experience is more pleasing, or preferred, we are inclined to use QoE methodologies and experiments to evaluate users’ opinions, and thereby come to a conclusion on what is perceived as the better opinion for the majority of users. In Chapter 3 we will discuss some prominent methodologies often used to determine QoE, specifically within the field of multimedia assessment.

2.2 Multimedia

Multimedia is the simultaneous use of different types of media to effectively communicate ideas or knowledge, commonly accessed by an information content processing device. Multimedia includes a combination

(21)

of audio, video, text, still images, animation, or interactivity formats. It may be either live or recorded, and is often divided into two categories, linear and non-linear. Linear multimedia content progresses without any navigational control by the user, like a cinema presentation, while non- linear uses interactivity to control progress, such as with a video game [53].

The primary question we need to ask ourselves is why do we evaluate multimedia? Alternative answers to this question might exist depending on the field of research, but commonly, and in the case of this thesis, the answer is fairly straight forward. We evaluate multimedia to examine if today’s encoding and compression technologies are acceptable and, if not, how to improve them. Basically, to find out how much encoding and compression is optimal. What we mean by optimal can be highly subjective. Commonly, the technologies can be seen as optimal or acceptable when the perceived multimedia quality of the result is good enough for the purpose it is intended to be used for. For example, a highly compressed, low resolution video might look excellent on a mobile device, but not on a High Definition TV (HDTV). This generally means little or no presence of compression artifactsin the final outcome of the encoding. To understand the source of artifacts in digital video, consider the schema of a typical digital video processing system presented in Figure 2.1:

Figure 2.1: Digital video processing.

Digital video is captured, represented, processed, transmitted, and finally displayed as a sequence of still images, or frames, at a particular frame-rate. Each image consists of a rectangular array of rectangular shaped pixels, each containing colour and brightness information of a small region in a captured scene. Artifacts may occur in any frame, and different artifacts can be produced during each of the aforementioned steps. We are mainly interested in artifacts that occur in the encoding process, and to some extent the transmission and decoding processes, which we will come back to in Section 2.2.2.

2.2.1 Encoding and Compression

The advances in Internet services and applications, the rapid development of mobile communications, and the importance of video communications, are all increasingly relevant these days. Users expect high quality content with little or no delay, and limitations in networks and bandwidth make it therefore necessary to compress data in order to meet the users’ expectations. Thus, encoding and compression are some of the enabling technologies for many aspects of what can be called a multimedia revolution [46].

Encoding involves representing a piece of information in another form.

(22)

For example, in hexadecimal encoding, one can represent 10 as 0xA.

Compression, however, is done entirely to lessen the number of symbols (or bits) to represent a given piece of information. This is achieved with the help of specific encoding of information. Different types of encoding give different levels of compression, but encoding does not always compress data. There are two main methods of compressing data, lossy and lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy, while lossy compression reduces bits by identifying unnecessary information and removing it. As the name indicates, no information is lost in lossless compression. Both methods are equally important, but are often be used for different purposes. For example, lossless compression would be important in situations where we want the reconstruction to be identical to the original. However, in situations where this requirement is not as important and more compression would be achieved, we can use lossy compression.

Figure 2.2: Compression and available bandwidth [12].

With the advances in technology and the expectations of the user that we mentioned initially, more and more data needs to be transferred over increasingly insufficient bandwidth. Using data compression, information can be shrunk into smaller sizes, which in turn enables larger amounts of data to be transferred at greater speeds. Moreover, compression is not only important in data communications. Today companies and people in general store a massive amount data on their computer systems.

This data would take up an unnecessary large space, if it were not for compression. This is especially the case when it comes to audio and video.

Development of better transmission and storage technologies to handle these vast quantities of data is ongoing, but unfortunately it is not enough.

Current studies show that especially mobile data traffic is under major growth, almost doubling every year [24]. Figure 2.3 shows that video traffic accounts for over 60 percent of these numbers alone, and will continue to grow. At the same time, mobile users expect high-quality video experience in terms of video quality, start-up time, reactivity to user interaction, and so on. Consequently, compression is highly important within this and many other aspects of today’s multimedia generation.

2.2.2 Compression Artifacts

Multimedia is subject to various kinds of distortions during the events of acquisition, compression, processing, transmission, and reproduction.

(23)

Figure 2.3: Global mobile data traffic forecast, 2012-2017 [24].

These shapes, or distortions, are what is referred to ascompression artifacts.

These artifacts are distortions that the human perception finds unnatural, and can consequently lessen the viewing experience [12]. Compression artifacts that are not related to data transmission are only present in lossy compression methods. Lossless compression does not discard any information, and therefore does not produce any artifacts of this kind.

Commonly, the minimisation of perceivable artifacts is a key goal in implementing a lossy compression algorithm. Some of the most prominent artifacts that may reduce the perceived quality of multimedia sequences are summarised in the following sections.

Block Distortion

Block distortion (also known as blockiness or blocking artifacts) manifest themselves as unnatural and easily perceptible blocks within an image. It is an image distortion defined by the inherent block encoding structure becoming visible [12]. These are often seen in compression methods that use block transformation coding, like the Discrete Cosine Transform (DCT), that group blocks of pixels together. As the block distortion in Figure 2.4 shows, edges are distinctly visible along the block structures.

Mosquito Noise

One specific artifact can be seen as ringing or other edge busyness in successive still images, which may appear in sequence as a shimmering blur of dots around edges. These are generally referred to asmosquito noise, as they resemble mosquitoes swarming around an object. Mosquito noise is most noticeable around artificial or computer generated objects or lettering

(24)

Figure 2.4: Distinct blocking artifacts in an image [12].

on a plain coloured background. Moreover, this effect is also visible around more natural shapes like a human body. It occurs when reconstructing the image and approximating discarded data by inverting the transform model. [12] [29]

Figure 2.5: Mosquito noise around edges of objects [12].

Quantisation Noise

Quantisation noise is defined in [29] as a "snow" or "salt and pepper"

effect similar to a random noise process, but not uniform over the image.

Consequently, it is fairly similar to mosquito noise, although it appears randomly across an image instead of particularity around edges.

Ringing Artifacts

Ringing artifacts are spurious ring-shaped visual echoes on sharp edges, echoes of hard edges, or oscillations or shimmering along the edges of an object against a relatively uniform background. This is commonly caused by coarse quantisation and loss of high frequency components in compression [54].

(25)

Blurriness

Blurriness is commonly defined as a global distortion over an entire image, characterised by reduced sharpness of edges and spatial detail. Reduction in sharpness of edges is often due to the attenuation of the high spatial frequencies [35]. Compression algorithms that trade off bits for code resolution and motion often cause this kind of artifact [29].

Jitter/Jerkiness

Jitter, or jerkiness, is motion that was originally smooth and continuous perceived as a series of distinct snapshots. It is the result of skipping video frames to reduce the amount of video information that the system is required to transmit or process per unit of time [29].

Auditory Artifacts

As with video, there are equally many artifacts pertaining to audio. Audio encoders are similarly complex in the way the process and compress audio data, and lossy audio compression may therefore result in a wide range of artifacts which reveal themselves as weird or unnatural noises. The two most common auditory artifacts are typically referred to as band-limited artifacts and birdie artifacts [39].

Packet Loss

In transmission of video or audio, the decoder might not receive all the encoded data because of loss or delay of data packets occurring in various layers of the underlying transmission network. In turn, this may produce unwanted artifacts during reconstruction. With the use of motion prediction in compression algorithms, a single packet loss can also affect many subsequent frames (motion-compensation artifact) [40].

Consequently, the resulting reconstruction of the compressed data may produce various errors over longer periods of time in a video sequence.

These artifacts occur more often in situations where bandwidth is limited or the network is prone to errors. Because the fault occurs during the transfer of data, packet loss artifacts fall under transmission artifacts rather than compression artifacts.

Asynchrony

Another transmission artifact type is asynchronous artifacts. They are, as the name indicates, noticeable asynchrony between audio and video during playback. Although asynchrony typically is not perceptible until it reaches >100 milliseconds out of sync either way [38], human subjects find synchrony issues to be especially annoying once they occur. Asynchrony may typically take place in situations where the audio and video streams are transmitted or processed separately, often with different delays.

(26)

2.3 Crowdsourcing

Crowdsourcing has emerged in recent years as a potential strategy to enlist the general public to solve a wide variety of tasks. The term itself is a combination of the two words crowd and outsourcing, a neologism that means utilising the general public’s wisdom rather than the expertise of employees or contractors [56, 7]. Consequently, crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, most often online using Internet crowdsourcing services. It combines the efforts of numerous self-identified volunteers or part-time workers, where each contributor of their own initiative adds a small portion to the greater result. Crowdsourcing is distinguished from outsourcing in that the work comes from an undefined public rather than being commissioned from a specific, named group.

The following figure shows a typical crowdsourcing value chain, where crowdsourcers seek workers via a facilitator, and in return receives a solution to the problem at hand.

Figure 2.6: A common crowdsourcing value chain.

2.3.1 Typology

Crowdsourcing can commonly be divided into different types, depending on what problem is to be solved. For example, Daren C. Brabham has put forward a problem-based typology of crowdsourcing approaches [4]:

• Knowledge Discovery and Management - for information management problems where an organisation mobilises a crowd to find and assemble information.

• Distributed Human Intelligence Tasking - for information management problems where an organisation has a set of information in hand and mobilises a crowd to process or analyse the information.

(27)

• Broadcast Search - for ideation problems where an organisation mobilises a crowd to come up with a solution to a problem that has an objective, provable right answer.

• Peer-Vetted Creative Production - for ideation problems where an organisation mobilises a crowd to come up with a solution to a problem which has an answer that is subjective or dependent on public support.

Online QoE assessment studies seem to fall under the distributed human intelligence tasking approach, as we have a set of information in the form of multimedia and we wish to mobilise a crowd to assess, or analyse, this information.

Additionally, several categories of crowdsourcing have been indicated to define ways in which people use crowds to perform tasks [21]. This includes, although not limited to, crowdvoting, crowdfunding, microwork, wisdom of the crowd, creative crowdsourcing, and inducement prize contests. For our project, microwork is the more relevant category to bring to attention. Microwork is a platform in which users do small tasks for which computers lack aptitude, generally for a small amount in payment.

Amazon’s Mechanical Turk [23] may be the most popular service for this type of crowdsourcing, and would be a recommended service to use for gathering participants for research using MAT.

2.3.2 Benefits

Online surveys are possibly the most popular application of the crowdsourcing strategy for user studies [7], and there are several benefits of using crowdsourcing to gather results compared to the traditional laboratory setting. Most significantly, they are more efficient in terms of time and monetary cost, since it is relatively easy to collect responses from a large number of people within a short time-frame online. Especially the lower price, compared to the price for hiring professionals or a general public off- line, draws researchers towards the crowdsourcing paradigm. Moreover, the high number of people who are ready to work for you at any time is commonly greatly beneficial. Online surveys may also be favourable for the participants, as they can respond at their own convenience. Hav- ing this option might also make participants more willing to complete the questionnaires. Additionally, online surveys do not have the "interviewer effect", where the interviewer may influence how participants answer the questions [7, 55].

While our assessment tool supports invitation of specific people for specific experiments, for example a group of experts for research within a specified field, crowdsourcing is obviously a highly valuable option for gathering participants due to the low costs and the reduced time and effort constraints it facilitates. Mass-creation of participant accounts and invitation of participants signed up through crowdsourcing services are features that are implemented in MAT.

(28)

2.3.3 Challenges

The main issue with crowdsourcing is the trustworthiness of the general internet user. Not everybody is trustworthy, unfortunately. Here we refer mainly to how the user accomplishes the crowdsourced task, if it is within a certain expected quality or not. Since crowd-workers completing tasks are paid per task, there is often a financial incentive to complete tasks quickly rather than well. Within many applications of crowdsourcing, verifying responses may be time-consuming, so having multiple workers complete the same task is often needed to correct discrepancies. However, having tasks completed multiple times increase both time and monetary costs.

Consequently, an interesting affect emerges; We trust our assessors less, and it is harder to exclude outliers. Some ethical issues also arise within crowdsourcing, which we will be coming back to in Section 3.2.

When it comes to user studies conducted using online surveys, as with our project, methods may be built into applications to verify responses quickly and inexpensively. To counteract the false results that may occur due to untrustworthy participants, surveys often include multiple questions designed to tap into the same topic. This way the correlation between related items will give an indication on the consistency in responses. The correlation between related questions serves as a measure of inter-item reliability. This is what is often referred to as consistency checking. Moreover, within the field of quality assessment in particular, the exact same question (or test item) may be displayed more than once within the course of a survey. By using this kind of repetition, the consistency of the participants responses can be further checked, and low correlation between identical questions are easily detectable. In some cases, so-called lie detector questions are also used. These are questions people will tend to lie on if they wish to present themselves in a better light. However, this type of question is seldom used in quality assessment studies.

In traditional laboratory experiments, participants normally view multimedia content in a controlled environment that equalises experiment conditions. In crowdsourced experiments however, participants often view the content under varied conditions, such as different screen sizes, surrounding lighting and various equipment qualities. This may be a disadvantage if the goal is to measure the quality of multimedia content in a specific scenario. On the other hand, it can be considered an advantage because the users’ perceptions can then be assessed in real-life scenarios.

Demographic factors also play an important role in crowdsourcing, and its issues. Some QoE assessment studies rely on a specific demographic make-up of participants [55]. However, crowdsourcing makes it difficult, if not impossible, to relate the assessment results to demographic factors, such as gender, age or location. Identifying each crowd worker, for example by asking questions about demographic elements, is unlikely to be effective as this data may not be trustworthy. Moreover, researchers cannot use sampling techniques to select candidate respondents, as they commonly do for face-to-face surveys [7, 55].

(29)

2.4 Existing Frameworks

Quality assessment has been an important topic for several decades within both business and research, and with the help of the Internet it has become an even more prominent factor. There are a large amount of software and common frameworks on the market today for running assessment surveys and studies. A question therefore arises; Why do we aim to develop yet another one? By looking at some of the alternatives out there, we reveal that there are very few tools that replicate the proposed features of MAT, and those that do are limited to a single evaluation method.

2.4.1 Online Assessment Tools

Online assessment tools, or survey applications as they are commonly referred as, are not hard to find on the Internet today. These are simple yet comprehensive tools in which experimenters can run many kinds of assessment studies. The participants commonly conduct these surveys by answering questions, one by one, by for example typing in an answer, selecting one of multiple choices, or selecting an item on a scale. A majority of the population connected to the Internet today have been through a survey of this kind, for example for evaluating a local supermarket or for measuring ones experience with a particular product or website.

Except for the presentation of multimedia sequences, our assessment tool greatly resembles that of an ordinary online survey application. This may infer that we could just use one of these tools for our purpose, however, there are quite a few issues that arise from that case. Firstly, these tools tend to be very extensive [42, 55]. This may be viewed as a positive factor, but in many cases these extra features are not necessary and in some worst cases they are just in the way. Building a framework from scratch, tailored to our specific needs, may be just as good if not better than some commercial tools. Secondly, these assessment tools are seldom free [42, 55].

For small studies or surveys, some software offer free, limited trials for commercial and private use, but as soon as one might want to scale things up, or want a few more features, the price increases accordingly. Most importantly, none of the tools we could find are specifically made for the presentation and assessment of multimedia content. Without the support for this content, together with the lack of the underlying methodology of QoE assessment of multimedia, it would be difficult to run any kind of proper assessment studies of this kind.

2.4.2 Specific Frameworks for Assessment of Multimedia

From what we have been able to find, there are actually surprisingly few frameworks available that specifically focuses on QoE assessment of multimedia content using crowdsourcing or online communications in general. One that stands out is a web-based platform facilitating QoE assessment of multimedia, called Quadrant of Euphoria [6, 7, 56].

This framework enables experimenters to create user studies on quality

(30)

assessment of images, audio clips and video sequences, and has the possibility to gather participants using the previously mentioned paradigm of crowdsourcing. The procedure of conducting an experiment is based on the subjective quality assessment method Pair Comparison (PC) [30], and consists of the participant being presented with several pairs of media items under scrutiny. These pairs are presented interchangeably on the screen, and the participant may switch between them using the space-bar, as seen in Figure 2.7. The participant is then tasked with choosing with one of the items presented he or she prefers, based on the perceived quality of the items. This will ensue in giving the experimenter results on which items (and hereby systems) are generally more accepted than other.

Figure 2.7: Quadrant of Euphoria’s experiment interface under both space- bar states [6].

Quadrant of Euphoria is thus similar in many ways to what we are developing in this project. However, some present issues still make it desirable to design and implement our own unique and different assessment tool. First of all, Quadrant of Euphoria only supports studies using the PC method. While PC is a strong method and often used in this kind of research, we aim to provide support for a number of other methodologies as well. These methods, together with PC, will be discussed further in Section 3.1. Secondly, if we were to get permission to further develop this tool, Quadrant of Euphoria seems to be partially developed in ActionScript for use with the popular browser plug-in Adobe Flash Player [26][6]. In addition to not having any previous experience with this development platform, Flash is a third-party application which unfortunately not everybody has installed. Moreover, with the entrance of HTML 5 onto the market, it is currently on an overall decrease in usage statistics on a worldwide basis [49], which indicates that it may not be supported forever.

2.5 Summary

Quality assessment has been an important topic in many areas of research for the past several decades. In this chapter, we have presented Quality of

(31)

Experience (QoE) as an important form of quality evaluation, which commonly measures user’s own individual opinions or subjective experiences.

QoE may be measured using both subjective and objective measures, however subjective methods often provide more factual assessments of user’s experiences. Furthermore, quality assessment of multimedia content in particular is becoming more relevant, as the usage of multimedia in our everyday lives is increasing rapidly.

Multimedia is the simultaneous use of different media to effectively communicate ideas or knowledge, and is used in a majority of aspects regarding for instance entertainment and communication. The reason we are interested in evaluating multimedia content is typically to examine if current encoding and compression technologies are sufficient or optimal.

However, this is highly subjective, depending on for example user’s preferences and the purpose it is intended to be used for. Moreover, encoding and compression is necessary in order to transmit and store data efficiently and quickly, over increasingly insufficient bandwidth and storage. Multimedia content commonly consists of a large amount of data, making compression even more essential. Unfortunately, multimedia is subject to various types of distortions or artifacts during not only compression, but also transmission and other processing. These artifacts may compromise the quality of the content, thus lessening the user experience.

This chapter has also proposed crowdsourcing as a promising method of reaching a more diverse crowd of participants for online multimedia assessment studies. Crowdsourcing includes benefits such as time- efficiency and low monetary costs. The high numbers of people ready to work at any time, with the possibility of working from any where at any time, contributes to the popularity and usability of the crowdsourcing paradigm. However, some challenges present themselves with this method. The main issue is the trustworthiness of the typical Internet user.

Environmental control and demographic factors may also play a small part within the challenges of crowdsourcing.

Furthermore, we have had a look at existing online assessment tools, pointing out the issues or shortcomings these present in the task of assessing multimedia content. Common online survey tools typically lack support for presentation and assessment methodology of multimedia, as well as generally being quite expensive. However, the web-based multimedia assessment framework Quadrant of Euphoria, is one that is similar to MAT in many ways. We are aiming to build a more flexible tool however, thus giving experimenters more freedom to design experiments according to their needs.

(32)

(33)

Chapter 3

Subjective Evaluation

As outlined, multimedia quality is typically approached by one of two methods; QoS, which assesses quality based on objective measures, and QoE, which considers the subjective opinions of assessors. In general, subjective quality assessment has no pre-established measure or standard and is thus based solely on the opinion of the evaluator, although some methods can use a point of reference to judge differences.

Multimedia quality assessment relies heavily on this type of subjective evaluation to gather data on the perceived quality of experience of human observers. Subjective assessment is useful for measuring end- user acceptance, comparing alternative algorithms and finding optimal designs or configurations when it comes to encoding and compression of multimedia content.

3.1 Methodology

Several test methods for subjective quality assessment have already been researched and extensively used for many years. International recommendations, such as ITU-R Rec. BT.500 [27], ITU-T Rec. P.910 [31] and ITU-T Rec. P.911 [30], provide us with outlines of the most prominent ones. The recommendations provide instructions on how to perform these tests for the assessment of video and/or audio quality, in a controlled laboratory environment. Although, our assessment application typically does not run in a controlled environment like these recommendations describe, the outlined stringency with which to run user studies remains highly relevant. The recommended test methods are commonly known as Absolute Category Rating (ACR), Degradation Category Rating (DCR), Paired Comparison (PC) and Single Stimulus Continuous Quality Evaluation (SSCQE). Common to them all is the showing of multimedia sequences to a group of viewers, with their opinion recorded and averaged, to evaluate the quality of each audiovisual sequence. While the premises vary between tests, their outcomes contribute with mean scores for a range of quality implementations.

(34)

3.1.1 Absolute Category Rating

Absolute Category Rating (ACR), also known as the Single Stimulus (SS) method [27], is a category judgement where the test sequences are presented one at a time and are rated independently on a category (rating) scale [30, 31, 28]. The recommendations specify that after each clip, subjects are asked to evaluate the quality of the sequence presented.

The presentation time may vary according to the content that is being evaluated, but the voting time should be limited to 10 seconds or less, depending on the voting mechanism used. A five point rating scale, as seen in Table 3.1, should be used. However, if a higher discriminative power is required, a larger scale may be used.

5 Excellent 4 Good 3 Fair 2 Poor 1 Bad

Table 3.1: ACR’s recommended five point rating scale.

3.1.2 Degradation Category Rating

Degradation Category Rating (DCR), also known as the Double Stimulus Impairment Scale (DSIS) method [27], is a test method in which test sequences are presented in pairs; the first stimulus in each pair is always the source reference, while the second stimulus is the same source presented through one of the systems under test [30, 31, 28]. In this case, the subjects are asked to rate the impairment of the second stimulus in relation to the reference. The total presentation and voting times are recommended to be the same as in ACR. A five point scale is similarly to be used here, but the wording should represent a rating of impairment, as presented in Table 3.2.

5 Imperceptible

4 Perceptible but not annoying 3 Slightly annoying

2 Annoying 1 Very annoying

Table 3.2: DCR’s recommended five point impairment scale.

(35)

3.1.3 Pair Comparison

The Pair Comparison (PC) method implies that the test sequences are presented in pairs, consisting of the same sequence being presented first through one system under test and then through another system [30, 31, 56, 7, 13]. Moreover, the source sequence may be included and would be treated as an additional system under test. Commonly, the systems under test are combined in all possible n(n - 1) combinations (AB, BA, CA etc.), thus to test each against all others, and in both possible orders. After presentation of each pair, the subject is tasked to choose which sequence is preferred. The voting time is similar to the previous methods, though the presentation time is recommended to be about 10 seconds.

Often in the case of large amounts of systems being tested, a huge amount of pairs of sequences could be constructed if every possible n(n- 1) combination were to be run. Eichhorn et al. [13] presents a possible solution to this problem, named randomised pair comparison (R/PC).

Using this method, each user is presented with a randomised subset of all combinations, hereby reducing the subjects time and effort significantly.

However, while showing results similar to full experiments run with all pair combinations, this method requires a larger amount of participants.

Additionally, the statistics in the results may become less conclusive. The importance of statistics will be explained further in Section 3.3.

3.1.4 Single Stimulus Continuous Quality Evaluation

Single Stimulus Continuous Quality Evaluation (SSCQE) is a test method that evaluates long-duration multimedia sequences, typically from 3 to 30 minutes. Subjects perform continuous subjective quality assessment, without any reference, by the means of moving sliders while looking and/or listening to a sequence [27, 30]. The results may be presented by plotting curves which indicate the percentage of time during which the subjective score was higher than a given score on a 0-100 scale. The method is consequently well suited to take into account temporal variations of quality and to make global quality assessments. The drawback however comes with having no reference, making it less suited for tests which require a high degree of discrimination [30].

3.1.5 Quality Evaluation of Long Duration Audiovisual Content In a recent paper, Borowiak et al. [3] presents a method for multi-modal, long-term quality assessment of audiovisual content. This method, hereby referred to as QELDAC for short, differs from the previously mentioned methods in that it is based on an adjustment of the quality during playback.

Assessors would adjust the quality to a desired level in the case where degradations occur, in comparison to giving a specific score which the other methodologies are based on. This eliminates the need for translating the perceived quality into a single number, which allows the subjects to focus on the content instead of directing their attention to the assessment task

(36)

itself. Moreover, the research can focus more on the subjects’ expectations and reactions to quality changes over longer periods of time.

3.1.6 Comparison

Each test method has its own set of advantages, and choosing which methodology to use for an assessment study may be not be as straight forward as one might imagine. An important issue in choosing a test method is the fundamental difference between methods that use explicit references (e.g. DCR) and methods that do not use any explicit reference (e.g. ACR, PC and SSCQE) [30, 31]. The latter does not test fidelity with regards to a source sequence, which is often important in evaluation of high quality systems [30]. In this case, when the viewer’s detection of impairment is an important factor, the DCR method is recommended. ACR may be simple and fast to implement, and the presentation of the stimuli is similar to that of the common use of the systems under test. Thus, ACR is well suited for qualification tests [30, 31]. The PC test method takes advantage of the simple comparative judgement task in which to prioritise a set of stimuli. Because of its high discriminatory power, it is particularly valuable when several of the test items are nearly equal in quality [30].

Moreover, when using a large number of items in the test the more time consuming this procedure may be, which may be an inconvenience in some cases. The methodologies that consider long-duration sequences (e.g. SSCQE and QELDAC) are obviously more suited in situations where sequences of a longer duration needs to be assessed. The two methods vary slightly in how the method translates in regard to a final outcome of the assessment study. SSCQE would be used when the preferred outcome is a score based on the perceived quality at certain intervals throughout the test item [30]. However, QELDAC may be better in situations where the researcher would like to know what quality level is acceptable for a potential user [3].

Considering the advantages and weaknesses associated with the different assessment methods, the appropriateness of each will depend on the planned experiment. Experimenters have varying needs and they require the freedom to run assessment studies according to their particular needs or preferences. MAT aims to offer a large range of options and specifications for methodologies and presentation modes. The main limitation lies with what is feasible to implement, within the parameters of this project. With enough time and effort, all of the outlined methodologies could be implemented without any apparent challenges, even to the exact specifications of the ITU [30]. However, methods designed to assess long- duration sequences may require adjustments to the current structure of the software. This will be discussed in more detail in the design and implementation chapter. Furthermore, MAT’s user interface makes it a convenient tool for experimenters of all levels of computer skills. The software can also manage large quantities of data, both on the input and the output side. Studies on multimedia quality demand that the experiment tool can handle presentation sequences of multiple audio

(37)

and/or video files. In addition, response data from dozens of participating assessors need to be collected, analysed, and reported. MAT is designed to handle large and numerous data files without interrupting the flow of the experiment planning and running.

3.2 Ethical Considerations

When conducting research and administrating online assessment studies, there are several ethical dilemmas that may be necessary to take in to consideration. We will discuss in short a few of these that are relevant to the topic at hand.

Reward and Money Psychologists have found that giving rewards in the form of money or other goods commonly reduce the motivation of a participant [19]. Whether this applies for online assessment studies as well may need further research, but it is something that experimenters may need to keep in mind. An additional ethical dilemma when it comes to money or rewards, is whether or not it might influence the outcome of the results. By rewarding participants, it is possible to "pay for the right answer", thus altering the natural outcome of an experiment. Although voluntary participants may be persuaded to answer falsely, they generally have less or no motivation to do so. However, this phenomenon is unlikely to occur in QoE experiments, as researchers are ordinarily interested in finding individual preferences.

Crowdsourcing and Wages Recently, researchers have argued that the wage conditions within crowdsourcing may be unethical [44]. Crowd- workers are not guaranteed a minimum wage, because they are considered independent contractors and not employees. Moreover, no written con- tracts, non-disclosure agreements, or employee agreements are typically made with crowdsourced workers. This gives the requesters the final say over whether users’ work is acceptable, and whether or not they will be paid. Although crowdsourcing may be viewed as slightly unethical towards the worker, the cheap labour is one of the main reasons it has become so popular.

Representativeness/Sampling Participants are essential to any experiment, and as such they need to be objectively taken into consideration when running experiments. First of all, experimenters commonly need to know if a participant is representative. Being representative means that the participant is within the demographic sample of people that the experiment is aimed towards. However, since determining demographic prop- erties of participants is such a challenge when using crowdsourcing, this may be hard to achieve [55, 56]. Consequently, when representativeness is important, the experiment should be directed towards a known group

(38)

of participants rather than an unknown crowd. Furthermore, when participant diversity is an accepted or desired property of the experiment, crowdsourcing may be especially suitable.

Fatigue Participant fatigue is a common outcome when similar tasks are expected to be performed repeatedly over long periods of time. When participants become fatigued or bored from repeating the same task over and over, as in typical assessment studies, responses may become less accurate. However, since MAT is aimed at being a flexible evaluation tool, the experimenter has control over repetitions and how long any experiment will be, thus reducing the potential of fatigue. Additionally, unlike common laboratory experiments, participants using MAT have the unique advantage of taking a break at any time he or she feels it necessary.

Anonymity and Confidentiality Anonymity and confidentiality are two significant topics when dealing with any kind of human population. An- onymity refers to concealing the identities of participants in all documents resulting from the research, while confidentiality is concerned with who has the right of access to the data provided by the participants [2]. People tend to prefer to appear anonymous when participating in online assessment surveys, especially when semi-sensitive information is requested. As- sociating results from assessment studies to specific people is commonly neither necessary nor useful, and anonymity is therefore a common feature for most assessment tools. Consequently, when using our tool, we provide total anonymity for participants conducting experiments and all data should be kept confidential.

Data Collection Assessment tools collect data from participants in order to produce results on the topic of the experiment. Data collection is a specifically important topic when it comes to both law and ethics. Not only are there restrictions on what kind of information can be collected and for what purpose, but also for what and how this data can be used or published subsequently. Moreover, any sensitive data collected must be stored securely so that it cannot be accessed by any unauthorised groups or individuals. Information on data collection and general disclaimers are therefore often included at the start of online surveys, to inform the participant of the purpose of the study and assure that the data will be in secure hands. In some cases, participants are asked to give consent to the planned use of the collected data, which often coincides with the possibility to withdraw from the experiment or study at any time as well.

Information on general data collection and anonymity is included in experiments created using MAT. However, since MAT allows experimenters to design several different experiments for different purposes, specific information or disclaimers on this topic should be added to the text presented prior to experiment commencement by the experimenters themselves.

(39)

3.3 Statistics

When dealing with any kind of assessment study, statistics play an essential part. Statistics is the study of the collection, organisation, analysis, interpretation and presentation of data. Consequently, assessment studies collect and organises data from its experiments for further analysis, interpretation and eventually presentation. Analysis of the data is needed to provide evidence for what is being researched in the experiment. For example, averaging user scores on each test condition as an absolute mean opinion score (MOS) is a typical approach. This is a statistical procedure, one of many different types.

Although MOS is simple to implement and use, it cannot express the confidence of a result. Thus, even a well designed experiment presented only by its MOS ratings may be rejected in a good publication. It is therefore recommended that researchers perform additional statistical analysis on the results of their experiments. For example, analysis of variance (ANOVA) is one parametric procedure that allows researchers to compute statistic outcomes including a confidence analysis.

In the results and statistics section of the experiments conducted using our assessment tool, results of a few basic statistical procedures are presented together with the experimentation data. This includes percentages, a mean opinion score and the standard deviation, together with the raw data. This is also available in a downloadable spreadsheet format, so that further analysis can be done by the researcher. More advanced statistical procedures, like ANOVA, may be implemented as well in future work.

3.4 Summary

Subjective evaluation of multimedia content is typically approached using a set of well documented and tested methodologies. In this chapter, we have discussed some of the prominent methods of quality assessment outlined by the International Telecommunication Union (ITU).

Furthermore, a comparison of these methods has been presented. The methods each have their own set of advantages, and are commonly used in slightly different settings. The DCR test method for example, uses an explicit reference for detection of impairments which is often important in evaluation of high quality systems. ACR on the other hand presents sequences one at a time, where users rate each clip on an rating scale, similar to the MOS test method mentioned in the introduction chapter. The PC method displays sequences in pairs, taking advantage of the simple comparative judgement task of prioritising one stimuli over another. Finally, the SSCQE and QELDAC methods consider long-duration sequences, used in situations where evaluation of long- duration multimedia is necessary. Because experimenters often require the freedom to run assessment studies according to their particular needs and preferences, we aim to provide a large range of options and specifications

(40)

for these methodologies and presentation modes in MAT.

Ethical and practical considerations are additional topics of importance when regarding online assessment studies. Several matters for consideration are discussed in this chapter. For example, giving rewards in the form of money or other goods for participation may alter the outcome of the research to some extent. Anonymity and confidentiality regarding the participants are also important factors to consider. This implies allowing participants to remain anonymous and ensuring that data collection is handled properly and securely, while keeping the data confidential.

Furthermore, we mention that statistical analysis of experiment results is necessary in any assessment study. Analysis is needed to provide evidence for what is being researched, and as such we have described briefly which statistical procedures are included and presented in MAT.

Moreover, the raw data is easily downloadable for further analysis and interpretation.

(41)

Chapter 4

Technologies & Frameworks

Throughout the development of our assessment tool, we have relied on a body of well-tested and popular technologies, frameworks and programming languages. Together, these form the foundation for the design and implementation of the MAT software. In order to provide insights on the build and workings of the application, this chapter outlines the technological background for its development.

4.1 Server-side

The LAMP software bundle is a set of free, open source software that is commonly used to build a viable general purpose web server [52]. The acronym LAMP refers to the four technologies used in this bundle; Linux (Operating System), Apache (HTTP Server), MySQL (Database Software), and either PHP, Perl or Python (Scripting). The exact combination may vary, especially with the choice of scripting software, but also when it comes to the operating system. Other operating system combinations include Microsoft Windows (WAMP), Mac OS (MAMP), Solaris (SAMP), iSeries (iAMP), or OpenBSD (OAMP). Some less used variants incorporate an alternate web server, like Microsoft’s Internet Information Services (WIMP), or even different database software, like PostgreSQL (LAPP). Figure 4.1 shows a graphical representation of the general LAMP architecture and illustrates how the components interact with each other. Each component is described in detail in the following sections.

The primary reason for the popularity of the LAMP combination, is the free of cost and open source of the software, which makes it easily adaptable. All the components come bundled with most current Linux distributions, which greatly improves the ease of use. Our tool uses the standard LAMP configuration as the web application server, with PHP as the chosen programming language. This setup is simple, but powerful, and covers all our requirements.

(42)

Figure 4.1: The LAMP software architecture.

4.1.1 Linux

Linux is the Operating System (OS) on which all the remaining technologies of the LAMP stack will run. Originally developed in the early 1990’s as a port of UNIX to the Intel x86 processor [36], it has become one of the more commonly used operating systems. Moreover, Linux, and variants of Linux, is the most popular operating system forserversand other larger systems [41]. The development of Linux is one of the most prominent ex- amples of free and open source software collaboration, contributing to its wide-spread use. Consequently, several different Linux distributions have been developed over the past two decades. The more prominent ones include Debian, Red Hat, (open)SUSE and Mandriva. MAT is designed to run on a Debian system, but is fully supported on any other Linux system with the LAMP software stack.

4.1.2 Apache

The Apache Hypertext Transfer Protocol (HTTP) Server is a highly efficient, secure and extensive web server that provides HTTP services in sync with the current HTTP standards [18]. The Apache web server is developed and maintained by an open community under the Apache Software Foundation, and has been the most popular web server since 1996 [18][41].

Consequently, it had a key role in the initial growth of the World Wide Web, and was in 2009 the first web server software to surpass the run of 100 million websites [41].

The primary function of a web server is to deliver web pages requested

(43)

by clients. This service is achieved through the Hypertext Transfer Protocol (HTTP) [15], which is the foundation of data communication over the World Wide Web. Commonly, it will deliver HTML documents and any additional content that may follow the document, such as images, style sheets and scripts. If unable to deliver, the server will respond with an error message. In addition to serving content to the client, HTTP includes methods for receiving content from clients. This feature is commonly used for submitting web forms and uploading files to the server. More importantly, web servers such as Apache support server-side scripting using PHP, and a number of other scripting languages. This function can be used to create dynamic web pages, rather than simply returning static HTML pages from the server’s secondary storage (See Figure 4.2).

Furthermore, server-side scripting adds the possibility to retrieve and modify large amounts of data from databases. This is key for developing highly dynamic applications like the Multimedia Assessment Tool.

Figure 4.2: Request and delivery of static content.

4.1.3 MySQL

The most widely used open source relational database management system (RDBMS) in the world, as of May 2013, is MySQL [10]. It is an especially popular database for use in web applications, mostly due to its involvement in the LAMP software stack. Moreover, high connectivity, speed and security make it greatly suited for accessing databases on the Internet. The MySQL database software runs as a standalone server, providing multi-user access to any number of databases. In short, a database is a structured collection of data, and to access and manipulate data stored in a database, a DBMS [11] like MySQL is required. DBMSs are used in a vast number of applications and software, because they typically provide the best and most efficient way of storing large amounts of structured data. The latter part of MySQL stands for Structured Query Language, and is again the most common standardised special-purpose programming language used to access databases [11]. The language is designed specifically for managing data stored in a relational DBMS.

Relational database systems, such as MySQL, store data in tables, as collections of rows and columns. In addition, these systems are responsible for providing relational operators to manipulate the data in tabular form.

Crowdsourcing Subjective Quality Assessment of Multimedia Content

UNIVERSITY OF OSLO Department of Informatics

Crowdsourcing Subjective Quality Assessment of

Multimedia Content

The Multimedia Assessment Tool

Master thesis

Ben Christopher Tomlin

15th November 2013

Crowdsourcing Subjective Quality Assessment of Multimedia Content

Abstract

Contents

List of Figures

List of Tables

Acknowledgements

Chapter 1

Introduction

1.1 Background

1.2 Problem Definition

1.3 Main Contributions

1.4 Research Method

1.5 Outline

Chapter 2

Background

2.1 Quality Assessment

2.2 Multimedia

2.3 Crowdsourcing

2.4 Existing Frameworks

2.5 Summary

Chapter 3

Subjective Evaluation

3.1 Methodology

3.2 Ethical Considerations

3.3 Statistics

3.4 Summary

Chapter 4

Technologies & Frameworks

4.1 Server-side